LangGraph: Building Stateful, Controllable AI Agents That Don't Go Off the Rails

TL;DR:

LangGraph models agent workflows as directed graphs (nodes = actions, edges = transitions), giving you precise control over state and execution flow
Built-in persistence and checkpointing let you pause, resume, and branch agent runs — essential for long-running tasks and human approval gates
Best suited for complex workflows where ReAct-style agents are too unpredictable; pairs well with LangChain but works standalone

Most developers’ first experience with LLM agents is a ReAct loop: the model reasons, calls a tool, reasons again, calls another tool, and so on until it has an answer or hits the token limit. For simple tasks, this works fine. For anything complex — multi-step pipelines, workflows requiring human sign-off, agents that need to remember context across many turns — it falls apart fast.

LangGraph, built by the LangChain team and now widely used independently, solves this by treating agent workflows as what they actually are: graphs. Nodes do work; edges decide what happens next. State flows through the graph and persists between steps. The result is agent architectures that are understandable, testable, and controllable.

The Core Model

In LangGraph, every workflow is a StateGraph. You define:

State — a typed dictionary (using Python TypedDict or Pydantic) that carries data through the workflow
Nodes — Python functions or LLM chains that read state, do work, and return updates
Edges — connections between nodes, which can be unconditional or conditional (branching based on state)

A minimal example:

from langgraph.graph import StateGraph, END
from typing import TypedDict

class AgentState(TypedDict):
    messages: list
    next_step: str

def call_llm(state: AgentState) -> AgentState:
    # Call LLM with current messages
    response = llm.invoke(state["messages"])
    return {"messages": state["messages"] + [response]}

def route(state: AgentState) -> str:
    # Decide next node based on LLM output
    if "DONE" in state["messages"][-1].content:
        return "end"
    return "call_tool"

graph = StateGraph(AgentState)
graph.add_node("llm", call_llm)
graph.add_node("tool", call_tool)
graph.add_conditional_edges("llm", route, {"end": END, "call_tool": "tool"})
graph.add_edge("tool", "llm")
graph.set_entry_point("llm")

app = graph.compile()

This gives you a loop where the LLM runs, a routing function decides whether to call a tool or finish, and the tool result feeds back to the LLM. It’s essentially a ReAct agent — but now you can see exactly where it is in the graph at any point, and you can interrupt it.

Why State Management Changes Everything

The killer feature isn’t the graph model itself — it’s what that model enables: persistent, inspectable state.

LangGraph’s checkpointing system can save the full state of a graph run after every node execution. You can use SQLite for local development or Postgres in production. This unlocks several things that are genuinely hard to do with standard agent loops:

Long-running workflows: An agent researching a competitive landscape might take 30 minutes and dozens of tool calls. With checkpointing, you can resume a run that was interrupted, or replay from any step.

Human-in-the-loop: You can add an interrupt_before or interrupt_after on any node. When the graph hits that point, it pauses and waits. A human reviews the state, optionally edits it, then sends a signal to continue. This is how you build approval flows for agents that might take consequential actions.

Time travel debugging: Because every step is persisted, you can roll back to an earlier checkpoint, change something in the state, and replay forward. Invaluable for debugging complex agent behaviours.

# Compile with checkpointing
from langgraph.checkpoint.sqlite import SqliteSaver

with SqliteSaver.from_conn_string(":memory:") as memory:
    app = graph.compile(checkpointer=memory, interrupt_before=["tool"])
    
    # Run until interrupt
    config = {"configurable": {"thread_id": "run-1"}}
    result = app.invoke({"messages": [user_message]}, config)
    
    # Human reviews, then continues
    app.invoke(None, config)  # Resume from checkpoint

Multi-Agent Patterns

LangGraph’s graph model composes naturally into multi-agent architectures. The most common pattern is a supervisor node that routes work to specialist subgraphs:

A supervisor LLM reads the task and decides which specialist to call (researcher, coder, writer)
Each specialist is its own graph with its own state
Results flow back to the supervisor, which decides whether the task is complete or needs another specialist

Each subgraph can be compiled separately and added as a node to the parent graph. This means you can test specialists in isolation before wiring them together — a significant improvement over monolithic agent systems.

LangGraph vs Raw LangChain vs Other Frameworks

vs LangChain LCEL: LangChain’s Expression Language is excellent for linear chains — prompt → LLM → parser → next prompt. The moment you need loops, branching, or state that persists across steps, you need LangGraph.

vs CrewAI: CrewAI offers a higher-level abstraction focused on role-playing agents with task delegation. It’s faster to prototype but gives you less control over execution flow. LangGraph is lower-level and more flexible.

vs raw Python: You could implement state machines manually. LangGraph provides the persistence, streaming, and debugging infrastructure so you don’t have to.

When to Use LangGraph

LangGraph is the right choice when:

Your workflow has conditional branching (different paths based on LLM output or tool results)
You need human-in-the-loop approval at specific points
The task runs long enough that failure recovery matters
You want to stream intermediate steps to a UI
You’re building a multi-agent system where coordination matters

It’s probably overkill for simple question-answering agents or single-tool lookups. For those, a direct API call or a simple ReAct chain is less code and easier to maintain.

Getting Started

pip install langgraph langchain-anthropic

LangGraph Studio, the visual debugging environment, runs locally and shows you the graph structure alongside state at each checkpoint — highly recommended for development. The LangChain team also provides LangGraph Cloud for managed deployment.

The official tutorials cover the main patterns well. Start with the basic ReAct agent example, then add a checkpointer, then experiment with interrupt_before. The concepts build on each other quickly.