← Back to blog

How Much Do AI Agents Actually Cost? A Real-World Breakdown

Your AI agent just resolved a support ticket. How much did that cost?

Not the LLM API call — the whole thing. The 5 LLM calls, the 3 tool invocations, the knowledge base lookup, and the email API call. Most teams can't answer this question. We couldn't either, so we instrumented our own workflows and started measuring.

Here's the anatomy of one task:

Real cost breakdown — support ticket resolution
“Resolve support ticket” — $1.10 total
LLM Calls (x5)
Classify ticket$0.01
Analyze context$0.18
Generate draft$0.22
Refine response$0.31
Summarize$0.08
LLM total$0.80
+
Tool Calls (x3)
MCP: search-kb$0.05
MCP: get-customer$0.02
API: send-email$0.10
Tool total$0.17
+
External APIs (x4)
Serper: 3 searches$0.03
Pinecone: vector query$0.04
Retries (2 failed parses)$0.06
SendGrid: send email$0.001
API total$0.13
Total task cost
LLM 73%MCP tools 15%APIs 12%
$1.10

Calculated from a support-resolution workflow using Claude Sonnet for analysis and Haiku for classification. 5 LLM calls averaging 8K input / 2K output tokens each at Anthropic's published rates. Tool and API costs based on published per-call pricing from each service. Your numbers will vary.

The LLM calls cost $0.80. But the total task cost was $1.10 — the MCP tool calls and external API fees added another $0.30 that wouldn't show up in your LLM provider dashboard. In this example the non-LLM costs are 27% of total. In workflows with heavier API use (web scraping, data enrichment, multi-step searches), we've seen that ratio flip — with non-LLM costs exceeding LLM costs.

What an AI agent task actually costs

These ranges are calculated from provider pricing pages and publicly reported production metrics:

Task typeLLM callsTool callsTypical costNotes
Simple chatbot response10$0.01-0.05Single turn, one model call
Support ticket resolution3-82-5$0.12-0.50Multi-turn with KB search
Multi-step research agent10-205-10$2-8Context grows without management
Code review agent5-153-8$1-5Large context windows (50-100K tokens)
Multi-agent team15-50+10-20+$5-50+Each agent maintains its own context window

The jump from "chatbot" to "agent" is where costs explode. A chatbot is one API call. An agent is a loop of calls, tool invocations, and context accumulation — and each iteration makes the next one more expensive as the context window grows.

LLM pricing reference (March 2026)

Prices change frequently — check Anthropic, OpenAI, and Google pricing pages for current rates. This section is a reference snapshot, not a comprehensive comparison.

The key insight isn't the absolute prices — it's the range. The cheapest production-quality API model costs $0.10/M input tokens (Gemini Flash-Lite). The most expensive reasoning model costs $15/M (o1). For agents making dozens of calls per task, model selection per step is the single biggest cost lever.

ProviderCheapest modelCost (input/M)FlagshipCost (input/M)
AnthropicHaiku 4.5$1.00Opus 4.6$5.00
OpenAIGPT-4o-mini$0.15o1 (reasoning)$15.00
GoogleFlash-Lite$0.10Gemini 2.5 Pro$1.25

Key discount mechanisms: Anthropic prompt caching gives 90% off input on cache hits. All three providers offer batch APIs at 50% off with 24-hour delivery.

Warning on reasoning models: o-series models consume internal "thinking" tokens billed as output but never returned to your application. A call that returns 500 visible output tokens might bill you for 5,000+. Always monitor actual billed tokens, not visible output.

What about self-hosted models? Running Llama, Mistral, or DeepSeek on your own GPU infrastructure eliminates per-token API costs but introduces capex (GPU hardware or cloud instances), ops burden (serving infrastructure, scaling, monitoring), and potentially lower quality on complex tasks. For teams processing millions of requests per month, self-hosting can be 3-10x cheaper. For most teams under 100K requests/month, the ops cost exceeds the API savings. This is a separate analysis — we're focused on API-based agents here.

Why agent costs are hard to predict

1. Non-deterministic execution paths. The same input can trigger 3 LLM calls or 30, depending on what the model decides to do. In our own testing, the same workflow showed p50 costs of $0.50 and p95 costs of $8+ — a 16x variance on identical input shapes. You can't forecast a monthly budget from averages alone.

2. Context accumulation compounds costs. Each step adds to the context window. Step 1 sends 2K tokens. By step 10, you're sending 20K+ tokens (all previous context plus the new instruction). Without context management — summarization, sliding windows, or RAG — costs grow quadratically. Most production agents use some form of context management, but the default behavior of "append everything" is expensive.

3. Hidden costs beyond LLM tokens. Tool calls cost money: a paid MCP server charges per invocation, an external API (geocoding, search, email) has its own pricing. In workflows with heavy tool use, we've seen LLM inference account for under half of total task cost.

Here's how retries compound these hidden costs: if your agent fails to parse a JSON response and retries, it's not just the extra LLM call. It's the extra API hit ($0.10), the extra MCP tool invocation ($0.05), and the extra context tokens from the failed attempt making the next LLM call more expensive. Three retries on a $0.15 tool call turns it into $0.60 — plus the growing LLM costs.

The horror stories

The $47K infinite loop. A multi-agent research tool slipped into a recursive loop. Two agents talked to each other non-stop for 11 days before anyone noticed, burning $47,000 in compute. The post-mortem identified no per-agent budget caps and no anomaly detection on API spend. (Teja Kusireddy, Towards AI)

The production database wipe. Alexey Grigorev was using Claude Code to update the DataTalks.Club platform. A misconfiguration on a new laptop confused the agent about what was "real" vs. safe to delete, and it erased the production database — 2.5 years of student submissions, gone. He wrote in his post-mortem that he had "over-relied on the AI agent" and removed safety checks. (Alexey Grigorev post-mortem, Fortune coverage)

The Replit database deletion during a code freeze. Jason Lemkin (SaaStr founder) documented an experiment where Replit's AI agent made unauthorized changes to live infrastructure during a period when it was explicitly told not to. It wiped data for 1,200+ executives, created 4,000 fake user accounts, generated false system logs, and admitted to "panicking" when confronted. (Fortune, July 2025)

Three different tools (open-source agents, Claude Code, Replit), three different companies, same root cause: unbounded autonomy with no cost guardrails and no kill switch.

How to cut agent costs: a worked example

Rather than listing optimization techniques in isolation, here's how they stack on a concrete workflow.

Starting point: A support ticket resolution agent running Claude Sonnet for all 5 steps. No caching. No routing. Cost per task: $1.60 (all LLM).

Step 1: Model routing. Route the classification step (step 1) and summarization step (step 5) to Haiku instead of Sonnet. These are simple tasks where Haiku performs equivalently.

StepBefore (Sonnet)After (routed)
Classify ticket$0.03$0.001 (Haiku)
Analyze context$0.45$0.45 (keep Sonnet)
Generate draft$0.52$0.52 (keep Sonnet)
Refine response$0.52$0.52 (keep Sonnet)
Summarize$0.08$0.003 (Haiku)
Total$1.60$1.49

Savings: 7%. Modest — because only the cheap steps were routed.

Step 2: Prompt caching. The system prompt (tool definitions, knowledge base excerpt, response guidelines) is 6,000 tokens and identical across all 5 calls. With Anthropic's prompt caching (90% off on cache hits), those 6,000 tokens cost 10% after the first call.

Before caching: 5 calls × 6,000 system tokens × $3/M = $0.09 in system prompt costs. After caching: 1 full price + 4 cache hits = $0.018 + $0.0072 = $0.025.

Applied to the full workflow: $1.49 → $1.42. Another 5%.

Step 3: Prompt optimization. The response guidelines section has verbose examples and redundant instructions. Cutting it from 2,000 tokens to 800 tokens reduces every call's input size.

Applied to the full workflow: $1.42 → $1.05.

Combined savings: $1.60 → $1.05 (34%). Not the 60-80% you see in headlines — but that's an honest number for a workflow that was already reasonably efficient. The 60-80% numbers come from workflows that start with bloated prompts, no caching, and frontier models on every step. If your baseline is wasteful, the savings are dramatic. If your baseline is already lean, the gains are incremental.

Where the big savings actually come from: If you switch the analysis and generation steps (the expensive ones) to a budget model that handles them at 90%+ quality — say GPT-4o-mini or a self-hosted model — the workflow drops from $1.05 to under $0.20. But that's a quality tradeoff, not a free optimization.

The missing piece: non-LLM cost visibility

You can't optimize what you can't measure. Today's tools (Helicone, Portkey, LangSmith, Langfuse) track per-API-call costs and can aggregate LLM token costs into traces. That's valuable.

What most tools don't track: the non-LLM costs. MCP tool call fees, third-party API charges, compute costs for sandbox execution. In workflows where agents use external tools and services, these can account for over half of total task cost. You see "$0.80 in LLM tokens" but the actual task cost was $1.10 — and in heavier workflows, the gap is much wider.

Without full-stack cost attribution, you can't price your AI features per customer, detect broken retry loops bleeding money on API fees, or know which step in a 15-step workflow is actually expensive.

We're building an open-source tool to solve this — AgentMeter (npm). It tracks per-task costs across LLM calls, tool fees, and API charges. We're early and would love feedback.


Prices sourced from official provider pricing pages as of March 23, 2026. Actual costs vary based on usage patterns, caching, batch discounts, and enterprise agreements.

Track your AI agent costs

Join the waitlist for early access to the hosted dashboard.