
Proposal: A Real Benchmark for Long-Term AI Memory Systems
The Problem Nearly every AI memory system is publishing scores on benchmarks that don't adequately measure what they claim to measure. We audited LoCoMo and found 6.4% of the answer key is factually wrong (99 errors in 1,540 questions), the LLM judge accepts 63% of intentionally wrong answers , and 56% of per-category system comparisons are statistically indistinguishable from noise . LongMemEval-S uses ~115K tokens per question — every frontier model can hold that in context. It's a better context window test than a memory test. Meanwhile, each system uses its own ingestion, its own answer generation prompt, and sometimes its own judge configuration — then publishes scores in the same table as if they share a common methodology. The Mem0/Zep benchmark dispute illustrates this perfectly: two companies testing the same systems, arriving at wildly different numbers. Ten Design Principles 1. Corpus must exceed context windows 1–2 million tokens of total context. Large enough to require ge
Continue reading on Dev.to
Opens in a new tab
