March 2, 20265 min readllms / tokens / cost

Token counting in 2026: every tokenizer compared

GPT-4o, Claude, Gemini, Llama, DeepSeek — each uses a different tokenizer, and the same prompt costs different amounts on each. Here's the comparison that matters for budgeting.

The same paragraph of text tokenizes differently on every LLM you'd use in production. Sometimes the difference is 5%; sometimes it's 30%. For budgeting, prompt design, and context-window planning, knowing which models bias which way is useful.

Here's the comparison, with the numbers.

What tokenization actually does

Modern LLMs don't operate on characters or words. They operate on tokens — subword units learned from training data. A tokenizer reads input text and breaks it into the model's specific vocabulary of tokens, each of which maps to an integer.

Most tokenizers in production use byte-pair encoding (BPE) or its variants. BPE starts with a base alphabet (often raw bytes), then iteratively merges the most-frequent pair of adjacent symbols until reaching a target vocabulary size. The merges are learned from training data, so common words become single tokens; rare words become multiple tokens; URLs and code punctuation tokenize into many tokens.

Different training corpora produce different merges. A tokenizer trained heavily on English code samples will tokenize function as one token; a tokenizer trained mostly on Chinese will tokenize the same word as 8 characters and 4-5 tokens.

The tokenizers in production (2026)

Provider	Tokenizer	Vocab size	Notes
OpenAI GPT-4o family	`o200k_base`	200,000	Successor to `cl100k_base`; better non-English support
OpenAI o-series (reasoning)	`o200k_base`	200,000	Same as GPT-4o
Anthropic Claude 4.x	Custom BPE	~100K-150K	Not publicly published; behaves close to `cl100k_base`
Google Gemini 2.x	SentencePiece (BPE flavor)	~256,000	Vocabulary heavily multilingual
Meta Llama 3.x	tiktoken-style BPE	128,000	Public via `tiktoken`
DeepSeek	tiktoken-style BPE	~100,000	Public

Same text, different token counts

Consider this 50-character snippet: "The quick brown fox jumps over the lazy dog."

Tokenizer	Tokens
`o200k_base` (GPT-4o)	11
`cl100k_base` (older GPT)	11
Claude (estimated)	11
Gemini SentencePiece	10
Llama 3	11

For plain English prose, the spread is tiny — within ~10%.

Now consider a piece of JSON: {"name":"Alice","age":30,"email":"alice@example.com"}

Tokenizer	Tokens
`o200k_base`	17
`cl100k_base`	18
Claude (estimated)	16
Gemini	21
Llama 3	19

Wider spread (~30%). Quoted strings and punctuation tokenize differently.

For emoji-heavy or non-Latin text the spread can be 2-3×. The Chinese character "你" tokenizes to 1 token in Gemini's vocabulary, 3 tokens in o200k_base, 4-5 tokens in older tokenizers.

What this means for budgeting

Three things to internalize:

1. The model that's cheapest per token isn't always cheapest per task. Llama 3 at $0.50/M input tokens may use 20% more tokens than GPT-4o-mini at $0.15/M input on your specific workload. The token-per-dollar comparison is closer than the raw price suggests.

2. Long context windows aren't equally long. A 128K-token context window in GPT-4o holds different amounts of text than a 128K-token window in Gemini. For multilingual or code-heavy contexts, Gemini's vocabulary often packs more useful text per token. For dense English prose, GPT-4o and Claude are close to equivalent.

3. Approximations are good enough for budgeting; exact counts matter only at hard limits. A 4-char-per-token heuristic is within ~10% of true counts for English. Our token counter uses this approximation to give you cross-model estimates without shipping 10MB of vocab files to every browser visit.

The hidden cost: tool definitions

Most discussion of token counts focuses on user input. The bigger source of cost variance in production is tool definitions.

A typical MCP server defines 5-15 tools, each with a name, a description, and a JSON Schema for inputs. Across the popular MCP servers (GitHub, Postgres, Slack, Notion, Brave Search), tool definitions easily reach 4-6K tokens. Those tokens get injected into every API call before any real prompt content.

If you're running an agent that makes 100 calls per day with 5 MCP servers connected, that's 4K-6K tokens × 100 calls = 400K-600K tokens of tool definitions, paid to the model. On Claude Sonnet 4.6 at $3/M input, that's $1.20-1.80/day just for tool advertisement.

The fix is prompt caching. Tool definitions are stable across calls and sit at the top of the prompt — perfect cache candidates. With caching enabled, the per-call cost drops to ~10% (Anthropic) or 50% (OpenAI) of the uncached rate.

What to measure

For any production LLM feature, log per call:

input_tokens total
output_tokens total
cache_read_input_tokens (Anthropic) or cached_tokens (OpenAI) for cached portion
cache_creation_input_tokens (Anthropic only) for cache-write events

The ratio of cache_read to input_tokens is your cache hit rate. Aim for 70%+ on workflows with stable system prompts or RAG contexts. Anything below 50% means there's variance in the prefix where there shouldn't be.

Tools

Our token counter takes any prompt and shows per-model token counts and per-call cost for OpenAI (GPT-4o family), Anthropic (Claude 4.x), Google (Gemini 2.5), Meta (Llama 3.3), and DeepSeek. The counts are approximations within ~5-10% of true tokenizer output; for exact counts at the edge of context windows, use the model provider's official tokenizer library.

For prompt engineering that maximises cache hit rate, the five-element prompt checklist is the structural pattern that puts stable content first — exactly what caching needs.