
Anthropic Never Released Their Tokenizer. Here's What We Found Testing the Alternatives
bpe-lite accuracy benchmark — report Date: 2026-03-19 Model tested against: claude-haiku-4-5-20251001 via Anthropic count_tokens API Tokenizers compared: bpe-lite (modified Xenova), ai-tokenizer (claude encoding), raw Xenova (unmodified) 1. Background bpe-lite is a zero-dependency JS tokenizer supporting OpenAI (cl100k / o200k), Anthropic (Xenova/claude-tokenizer, 65k BPE), and Gemini (Gemma3 SPM). Anthropic has not released the Claude 4 tokenizer, so the Anthropic provider is a reverse-engineered approximation sourced from Xenova/claude-tokenizer on HuggingFace, with hand-tuned modifications. This report documents the construction of a stratified accuracy benchmark and its results. 2. Benchmark corpus Design 120 samples across 12 categories (10 per category): Category Focus english-prose sentences, paragraphs, mixed punctuation, dialogue code-python functions, classes, decorators, f-strings, async code-js arrow functions, classes, JSX, TypeScript, async/await numbers integers, floats,
Continue reading on Dev.to JavaScript
Opens in a new tab


