Token counting in 2026: every tokenizer compared
GPT-4o, Claude, Gemini, Llama, DeepSeek — each uses a different tokenizer, and the same prompt costs different amounts on each. Here's the comparison that matters for budgeting.
The same paragraph of text tokenizes differently on every LLM you'd use in production. Sometimes the difference is 5%; sometimes it's 30%. For budgeting, prompt design, and context-window planning, knowing which models bias which way is useful.
Here's the comparison, with the numbers.
What tokenization actually does
Modern LLMs don't operate on characters or words. They operate on tokens — subword units learned from training data. A tokenizer reads input text and breaks it into the model's specific vocabulary of tokens, each of which maps to an integer.
Most tokenizers in production use byte-pair encoding (BPE) or its variants. BPE starts with a base alphabet (often raw bytes), then iteratively merges the most-frequent pair of adjacent symbols until reaching a target vocabulary size. The merges are learned from training data, so common words become single tokens; rare words become multiple tokens; URLs and code punctuation tokenize into many tokens.
Different training corpora produce different merges. A tokenizer trained heavily on English code samples will tokenize function as one token; a tokenizer trained mostly on Chinese will tokenize the same word as 8 characters and 4-5 tokens.
The tokenizers in production (2026)
| Provider | Tokenizer | Vocab size | Notes |
|---|---|---|---|
| OpenAI GPT-4o family | o200k_base |
200,000 | Successor to cl100k_base; better non-English support |
| OpenAI o-series (reasoning) | o200k_base |
200,000 | Same as GPT-4o |
| Anthropic Claude 4.x | Custom BPE | ~100K-150K | Not publicly published; behaves close to cl100k_base |
| Google Gemini 2.x | SentencePiece (BPE flavor) | ~256,000 | Vocabulary heavily multilingual |
| Meta Llama 3.x | tiktoken-style BPE | 128,000 | Public via tiktoken |
| DeepSeek | tiktoken-style BPE | ~100,000 | Public |
Same text, different token counts
Consider this 50-character snippet: "The quick brown fox jumps over the lazy dog."
| Tokenizer | Tokens |
|---|---|
o200k_base (GPT-4o) |
11 |
cl100k_base (older GPT) |
11 |
| Claude (estimated) | 11 |
| Gemini SentencePiece | 10 |
| Llama 3 | 11 |
For plain English prose, the spread is tiny — within ~10%.
Now consider a piece of JSON: {"name":"Alice","age":30,"email":"alice@example.com"}
| Tokenizer | Tokens |
|---|---|
o200k_base |
17 |
cl100k_base |
18 |
| Claude (estimated) | 16 |
| Gemini | 21 |
| Llama 3 | 19 |
Wider spread (~30%). Quoted strings and punctuation tokenize differently.
For emoji-heavy or non-Latin text the spread can be 2-3×. The Chinese character "你" tokenizes to 1 token in Gemini's vocabulary, 3 tokens in o200k_base, 4-5 tokens in older tokenizers.
What this means for budgeting
Three things to internalize:
1. The model that's cheapest per token isn't always cheapest per task. Llama 3 at $0.50/M input tokens may use 20% more tokens than GPT-4o-mini at $0.15/M input on your specific workload. The token-per-dollar comparison is closer than the raw price suggests.
2. Long context windows aren't equally long. A 128K-token context window in GPT-4o holds different amounts of text than a 128K-token window in Gemini. For multilingual or code-heavy contexts, Gemini's vocabulary often packs more useful text per token. For dense English prose, GPT-4o and Claude are close to equivalent.
3. Approximations are good enough for budgeting; exact counts matter only at hard limits. A 4-char-per-token heuristic is within ~10% of true counts for English. Our token counter uses this approximation to give you cross-model estimates without shipping 10MB of vocab files to every browser visit.
The hidden cost: tool definitions
Most discussion of token counts focuses on user input. The bigger source of cost variance in production is tool definitions.
A typical MCP server defines 5-15 tools, each with a name, a description, and a JSON Schema for inputs. Across the popular MCP servers (GitHub, Postgres, Slack, Notion, Brave Search), tool definitions easily reach 4-6K tokens. Those tokens get injected into every API call before any real prompt content.
If you're running an agent that makes 100 calls per day with 5 MCP servers connected, that's 4K-6K tokens × 100 calls = 400K-600K tokens of tool definitions, paid to the model. On Claude Sonnet 4.6 at $3/M input, that's $1.20-1.80/day just for tool advertisement.
The fix is prompt caching. Tool definitions are stable across calls and sit at the top of the prompt — perfect cache candidates. With caching enabled, the per-call cost drops to ~10% (Anthropic) or 50% (OpenAI) of the uncached rate.
What to measure
For any production LLM feature, log per call:
input_tokenstotaloutput_tokenstotalcache_read_input_tokens(Anthropic) orcached_tokens(OpenAI) for cached portioncache_creation_input_tokens(Anthropic only) for cache-write events
The ratio of cache_read to input_tokens is your cache hit rate. Aim for 70%+ on workflows with stable system prompts or RAG contexts. Anything below 50% means there's variance in the prefix where there shouldn't be.
Tools
Our token counter takes any prompt and shows per-model token counts and per-call cost for OpenAI (GPT-4o family), Anthropic (Claude 4.x), Google (Gemini 2.5), Meta (Llama 3.3), and DeepSeek. The counts are approximations within ~5-10% of true tokenizer output; for exact counts at the edge of context windows, use the model provider's official tokenizer library.
For prompt engineering that maximises cache hit rate, the five-element prompt checklist is the structural pattern that puts stable content first — exactly what caching needs.
More posts
Why your AI agent costs 10× what you expected
Agents look cheap in the demo and expensive in production. The gap is almost always one of four things — context bloat, retries, tool-call cascades, or the wrong model. Here's the math.
Prompt injection in production: the defenses that work
Most prompt injection mitigations advertised online don't survive contact with a determined adversary. Here are the four that do — used together, not in isolation.
MCP vs function calling: when each one wins
Function calling and MCP solve overlapping problems with different tradeoffs. Here's the decision tree we use — and the costs that bite when you pick wrong.