A single agent call costs fractions of a penny. But an agent that runs 10 tool calls per task, handles 5,000 requests a day, and carries a 10,000-token system prompt can cost thousands of pounds a month before you’ve got any real users. At scale, LLM costs aren’t a rounding error — they’re a primary business constraint.
The surprising part: most agent systems are dramatically over-spending in ways that are straightforward to fix. This guide covers where costs actually accumulate in agent loops and the techniques that reliably reduce them, with real numbers from production systems.
Where the Money Goes in an Agent Loop
Before optimising, you need to understand the cost structure. Agent costs come from four places.
Input tokens per turn. Every agent call includes your system prompt, conversation history, tool definitions, and any retrieved context. A complex agent with a 4,000-token system prompt, 3,000 tokens of history, and 2,000 tokens of tool schemas is already at 9,000 input tokens before the agent says a word.
Output tokens. Chain-of-thought reasoning, verbose tool call arguments, and lengthy generated text all cost output tokens — which are typically 3–5x more expensive than input tokens.
Loop iterations. Each step in an agent loop is a separate API call with full context. A 10-step task at 9,000 input tokens per step means 90,000 input tokens for one task completion.
Model tier. Running every call on the most capable (and most expensive) model when simpler sub-tasks don’t need it is one of the most common and most fixable sources of overspend.
Technique 1: Prompt Caching
Prompt caching delivers the best ROI of any optimisation for most agent systems. If your system prompt, tool definitions, and document context are consistent across calls — or within a session — the provider can cache that prefix and charge you far less for cache hits.
With Anthropic’s Claude, cache-read tokens cost $0.30/MTok versus $3.00/MTok for standard input — a 90% reduction on cached content. A 10,000-token system prompt reused across 1,000 calls saves roughly $27 just on that prefix.
Implementation with the Anthropic SDK:
response = client.messages.create(
model="claude-opus-4-5",
system=[{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}],
messages=conversation_history
)
Cache the largest stable prefix you have: system prompt + tool definitions + any static context. For a typical enterprise agent with a 6,000-token stable prefix processed 10,000 times per day, caching saves roughly $162/day at standard Claude Opus pricing — that’s more than £1,500 a month.
OpenAI also supports prompt caching automatically for context prefixes over 1,024 tokens on GPT-4o and later models, so you get the discount without any code changes.
Technique 2: Model Routing
Not all agent steps need your most capable model. A research agent doing literature search, fact extraction, and final synthesis is performing three qualitatively different tasks — and they don’t all require frontier-model intelligence.
A sensible routing strategy: use a smaller model (Haiku, GPT-4o-mini, Gemini Flash) for tool selection from a short list, input formatting and validation, extracting structured data from clean text, and simple yes/no decision gates. Use a mid-tier model (Sonnet, GPT-4o) for summarisation, basic analysis, and well-defined code generation. Reserve the frontier model for final synthesis, complex reasoning, and tasks where errors are costly.
A real example: a content pipeline that previously ran all 8 steps on Claude Opus 4 was rerouted so 5 preprocessing steps used Haiku and 3 final steps used Sonnet. Input cost per full pipeline run dropped from $0.42 to $0.11 — a 74% reduction — with no measurable quality regression on the final output.
LangGraph routing implementation:
def route_to_model(state):
task_type = state["current_task_type"]
if task_type in ["format", "validate", "extract"]:
return "haiku"
elif task_type in ["summarise", "analyse"]:
return "sonnet"
else:
return "opus"
workflow.add_conditional_edges("task_router", route_to_model, {
"haiku": "haiku_node",
"sonnet": "sonnet_node",
"opus": "opus_node"
})
Technique 3: Prompt Compression
Long prompts cost money on every call. Several compression approaches reduce token count without significant quality loss.
Conversation summarisation: Instead of passing the full conversation history, periodically summarise older turns. A 20-turn conversation history of 8,000 tokens compresses to a 400-token summary — saving 7,600 tokens on every subsequent call.
def compress_history(messages, keep_recent=4):
if len(messages) <= keep_recent:
return messages
older = messages[:-keep_recent]
recent = messages[-keep_recent:]
summary = summarise_with_cheap_model(older) # use Haiku
return [{"role": "system", "content": f"Earlier context: {summary}"}] + recent
LLMLingua for aggressive compression: The llmlingua library uses a small model to identify and remove low-information tokens from prompts. It can reduce prompt length by 3–5x with less than 5% performance degradation on many tasks — particularly effective on retrieved RAG context.
Tool schema pruning: Remove tools from the system prompt that aren’t relevant to the current task phase. If your agent has 15 tools but is in a “data collection” phase where only 3 apply, include only those 3. Tool schemas are often 100–200 tokens each — pruning 12 saves 1,200–2,400 tokens per call.
Technique 4: Semantic Caching
Beyond prompt caching (caching the prefix), semantic caching stores full LLM responses and serves them for semantically similar future queries — no model call needed at all.
When a query comes in, you embed it and check whether a similar query (cosine similarity above a threshold, typically 0.92–0.95) has been answered before. If yes, return the cached response instantly.
from gptcache import cache
from gptcache.processor.pre import get_prompt
cache.init(pre_embedding_func=get_prompt)
cache.set_openai_key()
response = openai.ChatCompletion.create(model="gpt-4o", messages=messages)
GPTCache and Zep both provide semantic caching out of the box. For a customer support agent where 30–40% of queries are near-duplicates, this can eliminate model calls for that whole fraction — saving both cost and latency in one go.
Technique 5: Batching
If your agent handles non-real-time tasks — data processing, report generation, overnight analysis — batching can cut costs by 50%.
Anthropic’s Message Batches API processes requests asynchronously at a 50% discount on all tokens. For an agent processing 10,000 documents nightly, batching that workload halves the cost with no change to output quality.
message_batch = client.messages.batches.create(
requests=[
{"custom_id": f"doc_{i}", "params": build_request(doc)}
for i, doc in enumerate(documents)
]
)
# Results available within 24 hours — ideal for async workflows
OpenAI offers equivalent batch pricing at 50% off via their Batch API.
Compound Savings: A Real Example
A mid-market SaaS team running a document processing agent before optimisation:
- Daily volume: 8,000 tasks
- Previous cost: $1,840/day ($55,200/month)
- Stack: All calls on GPT-4o, no caching, full history in every prompt
After applying prompt caching, model routing (Haiku for 6/10 steps, GPT-4o for 4/10), conversation summarisation, and batching overnight tasks:
- New cost: $620/day ($18,600/month)
- Saving: 66% reduction — $436,800/year
- Quality change: None measurable on their production eval set
The most impactful single change was model routing (saved $740/day). Prompt caching came second ($340/day). Together they accounted for 88% of total savings before any other technique was applied.
Where to Start
If you’re currently spending more than £400/month on LLM costs, audit in this order. First, add prompt caching to your system prompt and tool definitions — minimal code change, fastest return. Then map your agent loop steps and identify which genuinely require frontier-model quality. After that, implement conversation summarisation if your agents maintain long histories, add semantic caching if your queries have high repetition, and finally migrate batch-eligible workloads to the Batch API.
The 60% figure in this article’s title is conservative for many systems. Teams that haven’t looked at this problem yet routinely find 70–80% savings available through these techniques alone.