Getting Started with the Claude API: Setup, Models, and First API Call

TL;DR:

The Claude API is production-ready with a clean Python/TypeScript SDK — you can make your first call in under 10 minutes
System prompts and the max_tokens parameter are where most of the control lives; understand both before going further
Claude Haiku handles high-volume tasks cheaply; Claude Sonnet is the production workhorse; Opus is for complex reasoning where quality justifies cost

If you’ve been evaluating LLM providers and want to run your first Claude API call, this guide gets you from zero to a working integration — with context on the decisions that will actually matter in production.

Authentication and First Call

Start at console.anthropic.com — create an account, verify, and generate an API key. Store it as an environment variable; never hardcode it.

Install the SDK:

pip install anthropic

Your first call:

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Summarise the key risks in this contract: [contract text]"}
    ]
)
print(message.content[0].text)

Two things to know upfront: max_tokens is required — Claude won’t infer how long a response should be. And the response comes back in message.content[0].text, not message.text directly. The content field is a list to support multi-modal responses.

System Prompts and Key Parameters

System prompts are the single most impactful lever for shaping Claude’s behaviour. Pass them in the system parameter, not as a user message:

client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    system="You are a senior financial analyst. Respond only with structured analysis. Never speculate beyond the provided data.",
    messages=[{"role": "user", "content": user_input}]
)

A well-written system prompt can replace dozens of lines of output parsing. Define the output format, persona, and constraints here.

Key parameters that affect production behaviour:

Parameter	Default	Notes
`max_tokens`	Required	Set conservatively; you pay for output tokens
`temperature`	1.0	Lower (0.2–0.4) for structured outputs; higher for creative tasks
`top_p`	—	Alternative to temperature; don’t set both
`stop_sequences`	—	Useful for structured extraction

Streaming is worth enabling for any user-facing application where latency matters:

with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Model Selection: Haiku, Sonnet, Opus

Anthropic’s model lineup in 2026 follows a tiered structure, and the right choice depends on your task type and volume.

Claude Haiku is the speed and cost tier. It’s suitable for classification, extraction, routing, and any task where latency matters more than nuanced reasoning. At roughly 25x cheaper than Opus, it’s the right default for high-volume pipelines where heavier models would blow your budget.

Claude Sonnet is the production workhorse. It hits the performance-cost sweet spot for most real tasks: customer support, document analysis, code generation, RAG pipelines. This is where most teams land after initial experimentation.

Claude Opus is the reasoning tier. Deploy it for tasks that genuinely require extended reasoning — complex code review, multi-step analysis, cases where Sonnet makes mistakes you can’t accept. The cost difference is real, so benchmark against your actual task before defaulting to it.

A useful rule of thumb: start with Sonnet, test the failure cases, and upgrade to Opus only for the subset of tasks where quality gaps are measurable.

Prompt Caching for Cost Control

If your system prompt is long — detailed instructions, reference documents, tool schemas — prompt caching can cut costs significantly. Mark your static content with cache_control:

messages=[{
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": large_reference_document,
            "cache_control": {"type": "ephemeral"}
        },
        {"type": "text", "text": user_question}
    ]
}]

Cached tokens cost 10% of normal input token price on cache hits. For applications with a fixed large context — codebase analysis, document Q&A — this can reduce input costs by 70–80% on repeated calls. If you’re running anything at scale, this is worth implementing from day one.

Rate Limits and Production Readiness

The Claude API uses token-per-minute (TPM) and requests-per-minute (RPM) limits that vary by tier. Check your tier at console.anthropic.com. For production, implement exponential backoff on 429 responses (the SDK has a built-in retry handler you can configure), request queuing if you’re running parallel workloads near your TPM ceiling, and structured output validation — Claude is reliable at following JSON schemas, but validate before you trust.

The API has strong uptime reliability, but build defensive retry logic anyway. Transient errors at 3 AM are the ones that cause incidents.

Bottom Line

The Claude API is one of the cleaner LLM integrations to work with — the SDK is well-maintained, the documentation is thorough, and Claude’s instruction-following is reliable enough that system prompts alone can replace a lot of prompt engineering gymnastics. Start with Sonnet, use Haiku for volume, and treat prompt caching as a first-class cost optimisation from day one.