Why your AI agent costs 10× what you expected
Agents look cheap in the demo and expensive in production. The gap is almost always one of four things — context bloat, retries, tool-call cascades, or the wrong model. Here's the math.
The first time I shipped an LLM agent feature, the demo cost about three cents per run. The first thousand real users hit it and the bill came back at $340. Per day.
This pattern is common enough that it has a name: agent cost bloat. The gap between the demo cost and the production cost is almost never the model. It's the four things below.
1. Context bloat
A demo agent gets a 200-token system prompt and a 50-token user message. A production agent gets a 2K-token system prompt, 3K tokens of tool definitions, 8K tokens of retrieved context, and 200 tokens of user message. Same model, same task — 60× the input cost per call.
Fix: audit your prompt with our token counter. Anything above 5K input tokens for a single-turn agent call needs a justification you can defend.
2. Tool-call cascades
You wrote one tool. The agent calls it. It returns 500 tokens of JSON. The agent thinks about that JSON, calls a second tool. That tool returns 1200 tokens. The agent considers. Calls a third tool. By the time the agent answers, you've made four model calls and accumulated 8K tokens of intermediate state — billed as input on every subsequent call.
A typical multi-tool agent run on a real task isn't one call. It's six to twelve. Each one re-reads the entire conversation including all prior tool outputs.
Fix: cap tool calls per session (most agent frameworks support a limit). Compress tool outputs before they go back to the model. For tools that return verbose JSON, post-process to a 1-2 sentence summary before the model sees it.
3. Retries (silent and otherwise)
Production traffic hits rate limits, timeouts, malformed-JSON parser errors. Your agent framework retries — sometimes once, sometimes with exponential backoff up to five times. Each retry is a full call. Each call bills the full input prompt.
A 2% retry rate looks fine until you realize the retry burns the same 8K input tokens as the original call. Effective cost per successful run: 1.02× the baseline if you measure attempts, 1.10×-1.20× if you measure tokens billed.
Fix: log every retry as a separate event with its own token count. If your retry rate is above 1%, the underlying error (rate limit, prompt structure, tool schema validation) is the real cost driver — fix that, not the retry policy.
4. The wrong model for the task
GPT-4o costs ~30× more than GPT-4o-mini per input token. Claude Opus 4.7 costs ~5× Claude Sonnet 4.6. Most agent tasks don't need the flagship.
The smell: you reached for the most capable model because the agent flow was hard to get right. The fix wasn't the model. The fix was the prompt structure, the tool design, or the orchestration. Now you're paying flagship rates for a workload that would run identically on a mid-tier.
Fix: build the agent with a mid-tier model from the start (Sonnet 4.6, GPT-4o-mini, Gemini 2.5 Flash). Upgrade only after you've documented a specific task where the mid-tier fails and the flagship doesn't.
The math, with numbers
A real agent feature we shipped:
- System prompt: 1.8K tokens
- Tool definitions (4 MCP servers' worth): 3.2K tokens
- Retrieved context (RAG): 6.5K tokens
- Per-call tool output added to conversation: avg 800 tokens
- Average tool calls per run: 5
Tokens billed per run, on Claude Sonnet 4.6 ($3/M input, $15/M output):
- Initial call: 11.5K input + 200 output = $0.038
- Call 2 (after tool 1): 12.3K input + 200 output = $0.040
- Call 3: 13.1K input + 200 output = $0.042
- Call 4: 13.9K input + 200 output = $0.045
- Call 5: 14.7K input + 200 output = $0.047
- Total per run: $0.212
With prompt caching enabled on the stable prefix (system + tool defs + RAG = 11.5K cacheable tokens):
- Initial call: 11.5K cache-write ($3.75/M) + minimal new + 200 output = $0.046
- Calls 2-5: 11.5K cache-read ($0.30/M) + tool output + 200 output = avg $0.010 each
- Total per run: $0.086 — ~60% cheaper
Drop to Claude Haiku 4.5 ($1/M input, $5/M output) where it works:
- Total per run: ~$0.020 — another 75% off
Combine: caching + right-sized model on a multi-tool agent = ~10× cost reduction from the naive implementation. That's the gap most teams leave on the table.
What to actually measure
- Tokens per successful run (not per call). Includes all tool calls, all retries.
- Cache hit rate on the stable prefix. Aim for 70%+.
- Tool calls per run distribution. The long tail is where the cost is.
- Cost per business outcome (per booking, per ticket resolved, per workflow shipped). Tokens are a means.
Our prompt caching post covers caching mechanics in detail. The five-element prompt checklist is the structural pattern that makes caching work.
If you're shipping an agent and the cost feels off, run it through the token counter end-to-end — including system prompt + tool definitions — before you assume the model is the problem.
More posts
Prompt injection in production: the defenses that work
Most prompt injection mitigations advertised online don't survive contact with a determined adversary. Here are the four that do — used together, not in isolation.
MCP vs function calling: when each one wins
Function calling and MCP solve overlapping problems with different tradeoffs. Here's the decision tree we use — and the costs that bite when you pick wrong.
Database primary keys in 2026: int, UUID v4, v7, ULID, NanoID, KSUID
Six common primary key types, six tradeoffs. Here's when each one wins, with the specific failure modes that bite at scale.