TL;DR:

  • Observability for AI agents means capturing traces of every LLM call, tool use, and decision — not just the final output
  • Langfuse is the leading open-source option; it integrates with LangChain, LlamaIndex, and raw API calls in under 10 minutes
  • Without tracing, debugging multi-step agent failures is essentially guesswork

Your agent pipeline ran successfully for two weeks. Then, last Tuesday, it started returning incorrect answers and you have no idea why. The prompt hasn’t changed. The model hasn’t changed. But something is wrong, and your only diagnostic tool is print statements.

This is the observability problem for AI agents. Unlike a traditional API that either returns a correct response or throws an error, agents make multiple decisions, call multiple tools, and can fail silently — producing plausible-sounding wrong answers rather than obvious errors.

Proper observability is the difference between debugging in minutes versus debugging in days.

What AI Agent Observability Actually Means

For traditional software, observability means logs, metrics, and traces. For AI agents, the same three pillars apply, but the content is different:

Traces capture the full execution path of a single agent run: every LLM call made, every tool invoked, the input and output at each step, latency, and token usage. A trace for a research agent might span five LLM calls, three web searches, and a document retrieval step — all linked under a single parent trace.

Metrics aggregate across many runs: average latency per step, token consumption by agent type, error rate, cost per session. These tell you whether your agent is getting slower, more expensive, or less reliable over time.

Logs capture structured events — errors, fallbacks, retries — that don’t fit neatly into the trace hierarchy but matter for debugging specific incidents.

Langfuse: The Practical Choice for Most Teams

Langfuse is an open-source LLM observability platform that has emerged as the default choice for agent developers who want a self-hostable, framework-agnostic option. It supports LangChain, LlamaIndex, LiteLLM, CrewAI, and raw API calls. There’s a managed cloud offering and a Docker Compose setup for self-hosting.

The core concept is simple: every LLM call and tool invocation generates a span, which is nested inside a parent trace representing one full agent run.

Getting Started: LangChain Integration

Langfuse’s LangChain callback handler is the fastest on-ramp:

from langfuse.callback import CallbackHandler
from langchain.agents import create_tool_calling_agent

langfuse_handler = CallbackHandler(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com"
)

# Pass the handler when invoking your agent
agent_executor.invoke(
    {"input": user_query},
    config={"callbacks": [langfuse_handler]}
)

Every invocation now appears in your Langfuse dashboard with the full trace tree: which prompts were sent, what the model returned, which tools were called, token counts, and wall-clock time for each step.

Manual Instrumentation for Custom Agents

If you’re building a custom agent with direct API calls, use the @observe decorator:

from langfuse.decorators import observe, langfuse_context
import anthropic

client = anthropic.Anthropic()

@observe(name="research_step")
def research_step(query: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    )
    # Automatically traced: input, output, model, token usage
    return response.content[0].text

@observe(name="full_research_agent")
def run_research_agent(user_query: str) -> str:
    plan = research_step(f"Plan research steps for: {user_query}")
    findings = research_step(f"Execute research plan: {plan}")
    return research_step(f"Synthesise findings into answer: {findings}")

The @observe decorator creates nested spans automatically. The outer call run_research_agent becomes the parent trace; each research_step call becomes a child span. No manual span management required.

What to Look For in Traces

Once traces are flowing, the most valuable things to examine:

Prompt drift — your prompt template renders differently than you expected. Check the actual rendered prompt in the trace, not your template string. Variable interpolation errors, truncated context, and missing system prompts all appear here.

Tool call patterns — which tools are called most often, which fail, which are called redundantly. A tool being called five times in a loop is usually a sign of a missing termination condition in your agent logic.

Latency bottlenecks — which steps are slow? Often the culprit is a retrieval step fetching too many chunks or a model being called with an enormous context window. The trace view shows timing for each span.

Token cost attribution — Langfuse calculates cost per trace using current model pricing. Filter by user ID or session to understand cost per customer or per workflow.

Other Options Worth Knowing

Arize Phoenix is a strong alternative, particularly for teams already using Arize for ML model monitoring. It supports OpenTelemetry natively, which makes it the right choice if you want vendor-neutral instrumentation.

Weave (by Weights & Biases) integrates tightly with W&B’s experiment tracking and is the natural choice for teams already using W&B for model training.

Helicone focuses on LLM gateway-level observability — useful if you want to instrument all API calls at the request level without modifying application code.

Braintrust combines evaluation and observability, which is useful once you’re beyond basic tracing and want to run systematic quality evaluations.

Integrating Observability Into Your Development Workflow

Tracing is most valuable when it’s part of your regular workflow, not a retrospective debugging tool:

  • Run traces during development and review them for every new agent capability before shipping
  • Set up cost alerts for when a single session exceeds a threshold (Langfuse supports this)
  • Log user feedback (thumbs up/down) as Langfuse score events, linked to the trace — this lets you correlate feedback with specific execution patterns
  • Use trace data to build evaluation datasets: good traces become positive examples, bad traces become cases to fix

The agents that work reliably in production aren’t the ones written by the best prompt engineers — they’re the ones with owners who can actually see what’s happening inside them.