Anthropic Never Released Their Tokenizer. Here's What We Found Testing the Alternatives

bpe-lite accuracy benchmark — report Date: 2026-03-19 Model tested against: claude-haiku-4-5-20251001 via Anthropic count_tokens API Tokenizers compared: bpe-lite (modified Xenova), ai-tokenizer (claude encoding), raw Xenova (unmodified) 1. Background bpe-lite is a zero-dependency JS tokenizer supporting OpenAI (cl100k / o200k), Anthropic (Xenova/claude-tokenizer, 65k BPE), and Gemini (Gemma3 SPM). Anthropic has not released the Claude 4 tokenizer, so the Anthropic provider is a reverse-engineered approximation sourced from Xenova/claude-tokenizer on HuggingFace, with hand-tuned modifications. This report documents the construction of a stratified accuracy benchmark and its results. 2. Benchmark corpus Design 120 samples across 12 categories (10 per category): Category Focus english-prose sentences, paragraphs, mixed punctuation, dialogue code-python functions, classes, decorators, f-strings, async code-js arrow functions, classes, JSX, TypeScript, async/await numbers integers, floats,

Anthropic Never Released Their Tokenizer. Here's What We Found Testing the Alternatives

Related Articles

import networkx as nx import numpy as np import matplotlib.pyplot

Marc Andreessen is a philosophical zombie

The Asylum…and The Happy Birthday

Polymarket continues its partnership spree with a Major League Baseball deal

Monuses and Heaps

Related Articles

News
import networkx as nx import numpy as np import matplotlib.pyplot
Medium Programming • 3h ago

News
Marc Andreessen is a philosophical zombie
The Verge • 3h ago

News
The Asylum…and The Happy Birthday
Medium Programming • 3h ago

News
Polymarket continues its partnership spree with a Major League Baseball deal
TechCrunch • 4h ago

News
Monuses and Heaps
Lobsters • 4h ago