Testing AI Agents: Why Unit Tests Fail and What Actually Works

TL;DR:

Standard unit tests fail for AI agents because LLM outputs are non-deterministic — you need evaluation frameworks, not assertion-based tests
Golden datasets (curated input/expected output pairs) are the foundation of reliable regression testing
Production monitoring isn’t optional — LLM behaviour can drift with model updates even if your code hasn’t changed

Testing AI agents is one of the most consistently underinvested areas in AI development. It’s tempting to eyeball outputs, decide they’re “good enough,” and ship. Then a model update, a prompt change, or an edge case in production breaks something that was never tested — and you find out from a customer. Here’s how to build a testing discipline that actually catches problems.

Why Standard Unit Tests Break Down

The standard software test pattern — call function with input X, assert output equals expected Y — falls apart for LLMs because LLM outputs are non-deterministic. Temperature above zero means the same prompt produces different outputs on different runs. Even at temperature 0, model updates can change behaviour without touching your code.

What you actually need to measure is whether the output meets certain quality criteria (factually accurate, on-topic, correct format), whether the agent takes the right actions (calls the right tool, in the right sequence), whether it stays within bounds (no hallucination, no harmful content, no scope creep), and whether it degrades gracefully on edge cases.

These questions require an evaluator — often another LLM — rather than deterministic assertions. Counterintuitive, but necessary.

Evaluation Frameworks Worth Using

LangSmith is the most mature evaluation and tracing tool for LangChain-based agents, but it works with any LLM application. It lets you trace every run with full step-by-step visibility, create datasets of test cases and run evals against them, define custom evaluators (including LLM-as-judge for quality scoring), and track metrics across experiments and model versions.

For teams not on LangChain, PromptFoo is the strongest open-source alternative. Define your test cases in YAML, specify assertions (including LLM-graded ones), and run evaluations locally or in CI:

prompts:
  - "Classify this email as urgent or non-urgent: {{email}}"
providers:
  - openai:gpt-4o
  - anthropic:claude-sonnet-4-5
tests:
  - vars:
      email: "Server is down, production is affected, need immediate response"
    assert:
      - type: contains
        value: "urgent"
      - type: llm-rubric
        value: "Response should be a single word: urgent or non-urgent"

PromptFoo supports running evals across multiple models simultaneously — useful when you’re evaluating whether you can swap to a cheaper model without losing quality.

Building a Golden Dataset

A golden dataset is a curated collection of input/expected output pairs that represent the range of real cases your agent handles. It’s the foundation of all regression testing.

To build one, collect 50–100 real inputs from early users or internal testing. Include your hardest edge cases explicitly: ambiguous inputs, adversarial phrasing, rare but important scenarios. For each input, define what a good output looks like (either an exact expected output, or a rubric for evaluating quality), version the dataset, and treat it like code — commit it to your repo.

The dataset should grow over time. Every bug report that reaches production should result in a new test case added to the golden set. That’s how you prevent regressions.

For agentic workflows where the output is a sequence of tool calls rather than just text, log the full action trace for your “known good” examples. Test not just the final output but the path taken.

Regression Testing on Model Updates

LLM providers update models on their own schedules, not yours. When OpenAI releases a new GPT-4o checkpoint or Anthropic updates Claude Sonnet, behaviour can change for your specific prompts — sometimes better, sometimes worse.

To protect yourself: pin model versions where possible (e.g., gpt-4o-2024-11-20 rather than gpt-4o), run your golden dataset against new model versions before migrating, track a baseline score on your eval suite and alert when a new version drops below it, and for critical workflows, run old and new versions in parallel for a week before full cutover.

This sounds like overhead, but a single production incident from an unexpected model change costs more than the CI eval time.

Evaluating Multi-Step Agents

Single LLM calls are simpler to evaluate than multi-step agents. For agents taking sequences of actions, you’ve got a few useful approaches.

Trajectory evaluation asks whether the agent takes the correct sequence of steps. Define the expected tool-call sequence for your test cases and score how closely the actual trajectory matches.

Outcome evaluation asks whether the agent completed the task correctly, regardless of path. For tasks where multiple paths are valid, this is more appropriate than strict trajectory matching.

Intermediate state checking is useful for long-running agents — check not just the final output but intermediate states. Did the agent correctly summarise a document before passing it to the next step?

LangSmith handles trajectory evaluation natively. For custom frameworks, log intermediate states explicitly and write evaluators against them.

Production Monitoring

Testing before deployment is necessary, but it’s not sufficient. Once you’re live, log all inputs and outputs for sampling-based quality review. Set up automated anomaly detection on output length, response time, and error rate — sudden changes often mean something’s broken. Run periodic eval sweeps against your golden dataset using production traffic samples (with any personal data stripped — important for GDPR compliance). And monitor hallucination rate on factual claims if your agent makes assertions about real-world data.

Tools like LangSmith, Arize AI, and Traceloop all support production monitoring. For simpler setups, structured logging to a data warehouse with a weekly eval job covers the basics.

Bottom Line

Testing AI agents requires a fundamentally different approach than testing deterministic software — but the engineering discipline is the same. Define what good looks like, measure against it systematically, and don’t let model updates or code changes ship without running your eval suite. The teams that do this ship faster because they catch regressions before users do.