Proposal: A Real Benchmark for Long-Term AI Memory Systems

The Problem Nearly every AI memory system is publishing scores on benchmarks that don't adequately measure what they claim to measure. We audited LoCoMo and found 6.4% of the answer key is factually wrong (99 errors in 1,540 questions), the LLM judge accepts 63% of intentionally wrong answers , and 56% of per-category system comparisons are statistically indistinguishable from noise . LongMemEval-S uses ~115K tokens per question — every frontier model can hold that in context. It's a better context window test than a memory test. Meanwhile, each system uses its own ingestion, its own answer generation prompt, and sometimes its own judge configuration — then publishes scores in the same table as if they share a common methodology. The Mem0/Zep benchmark dispute illustrates this perfectly: two companies testing the same systems, arriving at wildly different numbers. Ten Design Principles 1. Corpus must exceed context windows 1–2 million tokens of total context. Large enough to require ge

Proposal: A Real Benchmark for Long-Term AI Memory Systems

Related Articles

Netflix’s Secret to Safe Automation at Scale • Aubrey Chipman & Roberto Perez Alcolea

Repository Pattern with Hygienic Macros in Scheme – Lisp

ELF & Dynamic Linking

Protecting Cookies with Device Bound Session Credentials

Total.js RCE gadgets all around

Related Articles

News
Netflix’s Secret to Safe Automation at Scale • Aubrey Chipman & Roberto Perez Alcolea
Reddit Programming • 4h ago

News
Repository Pattern with Hygienic Macros in Scheme – Lisp
Lobsters • 6h ago

News
ELF & Dynamic Linking
Lobsters • 7h ago

News
Protecting Cookies with Device Bound Session Credentials
Lobsters • 8h ago

News
Total.js RCE gadgets all around
Lobsters • 8h ago