Dev Hub Solutions

Product studio

Get in touch
7 min readllms / cost / fundamentals

Prompt caching changes the cost math. Most teams aren't using it.

Anthropic and OpenAI ship prompt caching that cuts repeated input-token cost by 50-90%. What changes when you turn it on — and why most teams haven't.

Most LLM cost analysis still assumes every input token gets billed full freight. That assumption was correct in 2023. It isn't now.

Anthropic shipped prompt caching in August 2024. OpenAI followed in October 2024. By 2026 every major API provider supports some form of it. The economics: a token that would have cost $3 per million on a Claude Sonnet input now costs $0.30 if it's a cache hit. A 90% discount, applied automatically, on the part of the prompt that doesn't change between calls.

If you're running a feature that hits the API repeatedly with a stable prefix — a system prompt, a long context document, a set of tool definitions, a retrieval corpus — and you haven't turned on caching, you're paying 10× what you should be.

How each provider does it

The mechanics differ, and the difference matters for how you structure prompts.

Anthropic (Claude)

Caching is opt-in via cache_control markers in the request body. You explicitly tag where the cacheable prefix ends:

{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "You are a code reviewer. ...long context...",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [{ "role": "user", "content": "Review this diff: ..." }]
}

The first call writes to the cache — billed at 125% of the normal input rate. Every subsequent call within the 5-minute TTL reads from the cache at 10% of normal. The break-even point is roughly two calls. Beyond that, you're saving.

Anthropic also supports a 1-hour cache tier (introduced late 2024) for prompts that get re-used across longer time windows, at a slightly higher write cost.

OpenAI (GPT-4o, GPT-4 Turbo)

Caching is automatic — no opt-in flag. OpenAI hashes the prefix of every prompt; if the same prefix appeared in a recent call (within 5-10 minutes), the cached tokens are billed at 50% of normal. Write cost is normal.

The cost ceiling is lower than Anthropic's (50% vs 10%) but the configuration is zero. You don't have to think about it; it just happens. The only requirement: keep the cacheable content at the start of the prompt, with dynamic content at the end.

Gemini, DeepSeek, others

Gemini supports explicit caching via a separate /cachedContents endpoint. DeepSeek bills cache hits at ~10% like Anthropic, also via explicit breakpoints. The pattern is converging.

What changes when you turn it on

Take a real workflow: a code-review assistant that runs against a 15K-token codebase context with a 500-token system prompt, asked to review a 200-token diff.

Without caching:

  • Per call: 15,700 input tokens × $3/M = $0.047
  • 100 calls/day: $4.70/day

With Anthropic caching (5-min tier):

  • First call: 15,500 cached-write + 200 normal = (15,500 × $3.75/M) + (200 × $3/M) = $0.059
  • Calls 2-N within 5 min: (15,500 × $0.30/M) + (200 × $3/M) = $0.005
  • If 90% of calls hit cache: avg per call ≈ $0.010 (~5× cheaper)

For a feature shipping at scale, that's the difference between $5K/month and $1K/month on Claude alone. The savings compound when the cached prefix is longer (long-context apps with 100K+ token contexts see 20-30× reductions).

The non-obvious gotchas

Cache invalidates on any prefix byte change. Add a timestamp to your system prompt and you defeat caching entirely. Tool definitions injected after the system prompt will still cache as long as their order and content are stable.

Cache is shared per API key + model + region, not per user. Two different end-users hitting the same system prompt benefit from each other's writes. This is great for SaaS workloads, neutral for single-user apps.

The 5-minute TTL is sliding, not fixed. Every cache hit resets the clock. So a moderately busy feature (one call every few minutes) maintains the cache indefinitely. A burst-then-idle pattern doesn't.

Token counters lie about cache cost. Off-the-shelf counters bill at full input rate. Our token counter shows the cached cost separately, so you can see what you're actually paying once caching is on. Worth running your highest-volume prompt through it before deploying.

MCP server tool definitions count. If you're using MCP via Claude Code or Claude Desktop, the tool schemas from each server get injected into every request. They're stable across calls and absolutely should be cached. Our MCPHub detail pages document the rough token weight of each server's tool definitions in the install notes.

When caching doesn't help

  • One-shot prompts with no repeated prefix. The whole point is amortising the write cost across many reads.
  • Prompts where the dynamic content comes first. Pre-flight a refactor — move the stable bits (system prompt, schema, examples) to the top, dynamic input to the bottom.
  • Tiny prompts (sub-1024 tokens for Anthropic; OpenAI has a similar floor). Below the minimum, caching doesn't apply.
  • Workloads with prefix variance. If your "stable" prefix changes per user (e.g., personalised context), you're not getting cache hits unless the variance is in the suffix.

What to actually do

  1. Audit your hottest API path. What's the system prompt + context length, and how often does it get re-used unchanged?
  2. For Anthropic: add cache_control: { type: "ephemeral" } at the boundary between stable and dynamic content. One-line change.
  3. For OpenAI: reorder your prompt so anything dynamic is at the end. Cache hits become automatic.
  4. Measure: log usage.cache_creation_input_tokens and usage.cache_read_input_tokens (Anthropic) or usage.prompt_tokens_details.cached_tokens (OpenAI). Calculate your actual hit rate. Aim for 70%+.
  5. Refactor prompts if hit rate is low. Most often it's because something dynamic — a timestamp, a session ID, a user-specific token count — leaked into the prefix.

The economic gap between teams that have implemented caching and teams that haven't is roughly 5-10× on input costs at scale. By 2026 there's no good reason to be in the second bucket.

Further reading

  • Token counter — see input + cached cost side by side for any prompt
  • Prompt refiner — structure your prompt so the stable parts cache cleanly
  • Five-element prompt checklist — the role/context/task/constraints/output pattern naturally puts the stable prefix at the top
  • MCPHub — tool definitions from MCP servers add tokens to every request; cache them