TL;DR:
- Unstructured LLM output is the leading cause of fragile agent pipelines — structured output solves this at the model level
- JSON mode guarantees valid JSON but not your schema; function calling and tool use enforce the actual shape
- Libraries like Instructor and Pydantic AI make schema-enforced generation practical in Python without boilerplate
If you’ve built anything more than a demo with an LLM, you’ve hit the wall: the model returns something close to what you asked for, but not exactly. A field is missing. A string where you expected an integer. Markdown wrapped around the JSON you needed to parse. Your downstream code breaks, or worse, silently passes bad data along.
Structured output is how you fix this at the source.
Why “just prompt it” doesn’t scale
Prompting the model to return JSON is the first thing everyone tries. It works, until it doesn’t. Models hallucinate closing braces, add explanatory text before the JSON block, or decide to return a list when you asked for an object. At low request volumes you can paper over this with retry logic and a lenient parser. At scale, or in any workflow where correctness matters, it becomes a maintenance burden.
The problem isn’t prompt quality. It’s that natural language instructions are fundamentally soft constraints — the model trades them off against other objectives. You need hard constraints enforced at the generation level.
The three levels of structured output
Level 1: JSON mode. Most frontier models now support a JSON mode flag that constrains the output to valid JSON syntax. OpenAI, Anthropic, Google, and Mistral all offer some version of this. It eliminates parse errors but says nothing about structure — you’ll get valid JSON that may have completely different keys than you expected.
Level 2: Function calling / tool use. When you define a tool schema (typically a JSON Schema object), the model is constrained to return output that matches that schema as its tool call arguments. This is structurally enforced — the model can’t return extra fields or change types. Most major model APIs support this, and it’s the most reliable mechanism available without custom inference infrastructure.
Level 3: Constrained decoding. At inference time, token probabilities are masked to only allow tokens consistent with your schema. Libraries like Outlines and guidance do this for locally-hosted models. It’s the most reliable approach but requires control over the inference stack — not available with hosted APIs.
For most production agent work, you want Level 2: tool-use-based schema enforcement via the model API.
Instructor: the pragmatic choice for Python
Instructor wraps the OpenAI, Anthropic, and Gemini clients and uses function calling under the hood to enforce Pydantic model schemas. The developer experience is clean:
import instructor
from anthropic import Anthropic
from pydantic import BaseModel
client = instructor.from_anthropic(Anthropic())
class ThreatAssessment(BaseModel):
threat_actor: str
confidence: float # 0.0–1.0
affected_sectors: list[str]
recommended_action: str
result = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Analyse this incident report: ..."}],
response_model=ThreatAssessment,
)
print(result.threat_actor) # always a string, always present
Instructor handles the tool schema generation from your Pydantic model, parses the response, validates it, and retries with validation error feedback if the model gets it wrong. The retry loop is configurable — you can set max_retries and get automatic correction behaviour.
Pydantic AI: agents with typed I/O by default
Pydantic AI takes this further by making structured output a first-class concept in the agent framework itself. Agents have typed return types, and tool call results are validated on ingestion. If you’re building a multi-step agent rather than a single-turn extraction call, Pydantic AI is worth evaluating.
from pydantic_ai import Agent
from pydantic import BaseModel
class ResearchSummary(BaseModel):
key_findings: list[str]
confidence: str
next_steps: list[str]
agent = Agent(
"claude-sonnet-4-6",
result_type=ResearchSummary,
)
result = await agent.run("Summarise the latest findings on...")
print(result.data.key_findings) # typed, validated
The framework wraps tool calls, handles retries, and surfaces validation errors as part of the agent run lifecycle. It plays well with LangGraph and other orchestration layers.
Practical patterns worth using
Discriminated unions for multi-step decisions. If your agent needs to return one of several possible action types, use a discriminated union with a type literal field. The model selects the action type and fills the appropriate fields — you get exhaustive type coverage in downstream handlers.
Nested schemas for complex extraction. Don’t flatten everything into a single shallow object. Nested Pydantic models mirror domain structure and give the model clearer semantic context for each field.
Optional fields with explicit null semantics. Use Optional[str] with a None default and document what None means in the field description. Avoid relying on field absence — JSON Schema required behaviour varies across implementations.
Validation with field validators. Add @field_validator methods to enforce domain constraints the schema can’t express — score ranges, enum membership, URL formats. Instructor will feed validation errors back to the model for self-correction.
What structured output doesn’t solve
Schema enforcement gets you syntactic and type correctness. It doesn’t guarantee semantic correctness — the model can still fill a confidence: float field with 0.87 when 0.2 would be more accurate. For factual accuracy you still need evaluation pipelines, human review, or retrieval-augmented grounding.
Structured output is a reliability layer, not a correctness layer. Use it to eliminate a whole class of infrastructure failures, then invest separately in the harder problem of output quality.
Getting started
If you’re using a hosted API and Python, start with Instructor — it has the widest model support and the lowest barrier to entry. If you’re building a greenfield agent system and care about end-to-end typing, look at Pydantic AI. If you control your inference stack, Outlines gives you the strongest guarantees.
The investment pays off quickly. Removing parse errors and schema mismatches from your error logs is immediately measurable, and typed agent outputs make downstream code dramatically easier to reason about.