Reasoning Models in Agent Pipelines: When to Use Them (and When Not To)

TL;DR:

Reasoning models (extended thinking, o3, Gemini 2.5 Pro) are significantly better on multi-step logic, ambiguous instructions, and complex planning — but cost 5–15x more and take longer
Most agent pipelines should use fast models for the majority of steps and reasoning models selectively for the steps that actually need them
The practical routing heuristic: use reasoning models for planning, decomposition, and ambiguous intent; use fast models for execution, formatting, and tool call invocation

The arrival of genuinely capable reasoning models has created a question most agent developers are navigating right now: when should I use a thinking model versus a standard model in my pipeline?

The naive answer — use the best model everywhere — is expensive and often slower without being meaningfully better. The equally naive answer — use the cheapest model everywhere — produces pipelines that fail on exactly the cases that matter. The answer that actually works is more interesting.

What reasoning models actually do differently

Reasoning models don’t just output an answer. They generate a chain of thought — sometimes explicitly (like Claude’s extended thinking), sometimes internally before producing a response — that allows them to catch errors in their own logic, backtrack, and arrive at conclusions through structured deliberation rather than pattern matching.

On standard benchmarks this produces measurable improvements on tasks involving: multi-step maths and logic, interpreting ambiguous or underspecified instructions, planning sequences of actions where later steps depend on earlier ones, and synthesising conclusions from conflicting evidence.

On tasks like formatting output, executing a well-defined tool call with clear parameters, or generating text where the structure is already specified, the improvement is marginal at best. You’re paying for deep reasoning that the task doesn’t require.

The cost and latency reality

A typical comparison in mid-2026:

Model	Relative cost (input+output)	Typical latency
Claude Haiku 4.5	1x	~0.5–1s
Claude Sonnet 4.6 (standard)	~5x	~1–3s
Claude Sonnet 4.6 (extended thinking)	~10–20x depending on budget	~5–30s
Gemini 2.5 Flash	~2x	~0.5–1s
Gemini 2.5 Pro	~10x	~3–10s
o3-mini	~4x	~2–5s

For a pipeline making 50 LLM calls per user request, indiscriminate use of reasoning models can push cost per request above what’s viable for most products. Latency compounds similarly — a 10-step pipeline with 10s average step latency takes 100s to complete.

A practical routing framework

Think of your pipeline steps in three categories:

Planning and decomposition. When an agent receives a complex or ambiguous task and needs to break it into sub-tasks, decide which tools to use, or formulate a strategy — this is where reasoning models pay off. The improved planning quality reduces downstream errors that would otherwise require re-runs. Use a reasoning model here.

Execution steps. Invoking a specific tool with clear parameters, calling a well-defined API, extracting structured data from a known format — these are pattern-matching tasks. The quality improvement from a reasoning model is minimal. Use a fast model.

Synthesis and judgement. When the agent needs to combine results from multiple tool calls, resolve contradictions in retrieved information, or make a nuanced decision about next steps — this is another candidate for reasoning model involvement. How much it matters depends on the complexity of the synthesis required.

Output formatting. Generating the final response in a specific format, summarising content, or producing structured output for downstream consumption — standard model, no need for extended reasoning.

In code, this often looks like:

PLANNER_MODEL = "claude-sonnet-4-6"  # with extended thinking enabled
EXECUTOR_MODEL = "claude-haiku-4-5"
SYNTHESISER_MODEL = "claude-sonnet-4-6"  # standard, no thinking

async def run_task(task: str):
    plan = await plan_with_thinking(task, model=PLANNER_MODEL)
    results = await execute_steps(plan, model=EXECUTOR_MODEL)
    return await synthesise(results, model=SYNTHESISER_MODEL)

Signals that a step needs a reasoning model

Some practical indicators that a pipeline step warrants a reasoning model:

The step involves interpreting an instruction that could plausibly mean several different things
The step requires planning more than 3–4 sequential actions where order matters
Previous runs show the step frequently producing incorrect or partial results with a standard model
The step involves mathematical calculation, logical inference, or constraint satisfaction
The cost of the step failing downstream (re-runs, user-facing errors) is high relative to the marginal cost of a reasoning model

Extended thinking tuning

For Claude’s extended thinking specifically, you can control the thinking budget (how many tokens the model allocates to reasoning before responding). Higher budgets improve quality on harder problems; lower budgets reduce cost and latency.

A practical approach: start with a moderate budget (around 2,000–4,000 thinking tokens), test on a representative sample of real inputs, and adjust. On problems that are genuinely hard, additional thinking budget continues to help. On problems that are just moderately complex, the returns flatten quickly.

What doesn’t work

Routing based on task type labels (“this is a coding task, therefore use reasoning model”) is brittle — the same task type varies hugely in difficulty. Routing based on input length is similarly unreliable.

The most reliable signal is failure rate on a standard model. If a particular step in your pipeline fails or produces low-quality output more than ~15% of the time with a fast model, that’s a strong candidate for reasoning model replacement. If it’s working reliably, leave it alone.

The bottom line

Reasoning models are genuinely better on genuinely hard reasoning tasks. They’re not meaningfully better on execution tasks that a capable standard model handles reliably. Building a heterogeneous pipeline — planning with thinking, executing with fast models, synthesising with something in between — is the approach that gives you most of the quality benefit at a fraction of the cost.

Start with all-standard, measure where failure rates are highest, and introduce reasoning models surgically at those points.