
WMB-100K: We built the first 100,000-turn benchmark for AI memory systems
Most AI memory benchmarks are surprisingly small. LOCOMO tests 600 turns. LongMemEval tests around 1,000. That's roughly one week of casual usage. But real AI companions, assistants, and memory systems don't get used for a week — they get used for months. Years. What happens to memory accuracy at that scale? Nobody had tested it. So we built WMB-100K. What it is WMB-100K is an open-source benchmark that tests AI memory systems at 100,000 turns — roughly a year of heavy usage. It measures one thing: can your memory system find the right information when it matters? Not LLM reasoning. Not response quality. Just memory. What makes it different Three things set WMB-100K apart from existing benchmarks: Scale — 100,000 turns across 10 life categories (daily life, relationships, health, career, finances, and more) Difficulty levels — 5 levels from simple fact lookup to multi-hop reasoning across 3,134 questions False memory probes — 430+ questions about things that were never mentioned. "I do
Continue reading on Dev.to Python
Opens in a new tab




