FlareStart
HomeNewsHow ToSources
FlareStart

Where developers start their day. All the tech news & tutorials that matter, in one place.

Quick Links

  • Home
  • News
  • Tutorials
  • Sources
  • Privacy Policy

Connect

© 2026 FlareStart. All rights reserved.

Back to articles
We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally
NewsMachine Learning

We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally

via Dev.toPenfield5h ago

Projects are still submitting new scores on LoCoMo as of March 2026. We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found. LoCoMo LoCoMo ( Maharana et al., ACL 2024 ) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors. Examples: The answer key specifies "Ferrari 488 GTB," but the source conversation contains only "this beauty" and the image caption reads "a red sports car." The car model exists only in an internal query field (annotator search strings for stock photos) t

Continue reading on Dev.to

Opens in a new tab

Read Full Article
4 views

Related Articles

News

CVA6-CFI: A First Glance at RISC-V Control-Flow Integrity Extensions

Lobsters • 1h ago

News

ILLEGAL 3D Rendering Techniques (N64)

Reddit Programming • 3h ago

News

The Overton Window for Code Review Is Shifting

Reddit Programming • 5h ago

Join a list of strings with '[' as prefix, ']' as suffix, and ',' as delimiter using streams.
News

Join a list of strings with '[' as prefix, ']' as suffix, and ',' as delimiter using streams.

Dev.to • 6h ago

Absurd In Production
News

Absurd In Production

Lobsters • 8h ago

Discover More Articles