TL;DR:
- Poor tool definitions are the leading cause of agent failures — precision in JSON Schema pays back immediately
- Parallel tool calls cut latency by 40–70% on multi-step tasks and reduce token costs
- Structured error handling in tool responses prevents agents from hallucinating when things go wrong
Tool calling is the connective tissue of every serious AI agent. The model decides what to do; tools do the actual work — reading files, querying APIs, running code, writing to databases. When tool calling is solid, agents feel autonomous and reliable. When it’s fragile, they loop, hallucinate, or silently produce wrong results.
Most agent failures trace back not to the model itself, but to how tools are defined and how errors are returned. Here’s what distinguishes production-grade tool use from the default approach.
Write Tool Descriptions Like Documentation for a Confused Junior Developer
The model has no prior knowledge of your system. Every tool definition should answer three questions unambiguously:
- When should I call this tool? (not just what it does)
- What are the exact constraints on each parameter?
- What will the response look like?
A weak description:
{
"name": "get_user",
"description": "Get user information",
"parameters": {
"type": "object",
"properties": {
"id": { "type": "string" }
}
}
}
A strong description:
{
"name": "get_user",
"description": "Fetch a user record by their unique ID. Use this when you need profile details, preferences, or account status. Returns null if the user does not exist — do NOT call this speculatively to check existence, use user_exists instead.",
"parameters": {
"type": "object",
"properties": {
"user_id": {
"type": "string",
"description": "UUID v4 format, e.g. '550e8400-e29b-41d4-a716-446655440000'. Found in session context or previous user_search results."
}
},
"required": ["user_id"]
}
}
The second version tells the model when not to use it, what format the parameter must be in, and where to find the value. These constraints prevent the most common failure modes.
Use Parallel Tool Calls Aggressively
Most LLM APIs support returning multiple tool calls in a single response. This means the model can request several independent actions simultaneously rather than waiting for each to complete before deciding on the next.
Consider an agent asked to summarise three Jira tickets. The naive approach:
- Call
get_ticket(ABC-1)→ wait → get result - Call
get_ticket(ABC-2)→ wait → get result - Call
get_ticket(ABC-3)→ wait → get result - Summarise all three
A parallel approach returns all three calls at once, dispatched concurrently on the client side. Wall-clock time drops from 3× latency to 1× latency.
# Claude returns multiple tool_use blocks in one response
tool_calls = response.content # [ToolUseBlock, ToolUseBlock, ToolUseBlock]
# Execute all in parallel
import asyncio
results = await asyncio.gather(*[
execute_tool(call.name, call.input)
for call in tool_calls
if call.type == "tool_use"
])
# Return all results in a single user message
tool_results = [
{"type": "tool_result", "tool_use_id": call.id, "content": str(result)}
for call, result in zip(tool_calls, results)
]
The key constraint: only parallelise tool calls that are genuinely independent. If tool B needs the output of tool A, they must remain sequential.
Make Tool Errors Informative, Not Fatal
The single worst thing a tool can return is an unhandled exception message. The model will either retry blindly, apologise and stop, or — worst — hallucinate a plausible-sounding result.
Return structured errors in tool responses instead:
def get_customer_orders(customer_id: str, limit: int = 10) -> dict:
try:
orders = db.query_orders(customer_id, limit)
return {"success": True, "orders": orders, "total": len(orders)}
except CustomerNotFoundError:
return {
"success": False,
"error": "customer_not_found",
"message": f"No customer with ID {customer_id!r}. Check the ID format — customer IDs begin with 'CUS_'.",
"suggestion": "Use search_customers to find the correct ID first."
}
except RateLimitError as e:
return {
"success": False,
"error": "rate_limited",
"retry_after_seconds": e.retry_after,
"message": "Database rate limit hit. Wait before retrying."
}
A model that receives "error": "customer_not_found" with a corrective suggestion can self-correct. One that receives a raw Python traceback cannot.
Validate Inputs Before Calling the Tool
JSON Schema can express many constraints, but not all. Wrap tool execution with a validation layer that catches bad inputs before they reach downstream services:
def validate_date_range(start: str, end: str) -> str | None:
"""Returns an error message or None if valid."""
try:
s = datetime.fromisoformat(start)
e = datetime.fromisoformat(end)
except ValueError:
return "Dates must be ISO 8601 format: YYYY-MM-DD"
if e < s:
return f"end_date ({end}) must be after start_date ({start})"
if (e - s).days > 365:
return "Date range cannot exceed 365 days"
return None
def get_analytics_report(start_date: str, end_date: str) -> dict:
if error := validate_date_range(start_date, end_date):
return {"success": False, "error": "invalid_parameters", "message": error}
# proceed with valid inputs
This pattern catches mistakes the model makes due to ambiguous instructions before they cause downstream failures.
Limit Tool Surface Area
Fewer tools with clearer purposes outperform many tools with overlapping scopes. If the model must choose between search_users, find_user_by_email, lookup_account, and get_member_info, it will sometimes choose wrong.
Consolidate where you can:
{
"name": "find_user",
"description": "Look up a user by any identifier. Pass exactly one of: user_id (UUID), email, or username. Returns null if not found.",
"parameters": {
"type": "object",
"properties": {
"user_id": { "type": "string" },
"email": { "type": "string", "format": "email" },
"username": { "type": "string" }
},
"minProperties": 1,
"maxProperties": 1
}
}
A single tool with a discriminated input is easier for models to reason about than four overlapping ones.
Log Every Tool Call and Result
Observability starts with complete tool call logs. At minimum, record: tool name, inputs, output (truncated if large), latency, and whether it succeeded. This data is essential for diagnosing why an agent behaved unexpectedly and for building evals.
Structured logging with a correlation ID that spans the full agent run makes it possible to replay any failure and understand exactly what the model received at each step.
The reliability gap between demo agents and production agents is mostly an engineering problem, not a model capability problem. Tighten your tool definitions, handle errors explicitly, and run tool calls in parallel — the improvement in reliability is immediate and measurable.