I evaluated 700+ AI responses across 5 quality axes — here's the complete dataset and what it reveals

This is a follow-up to my previous post about TRI·TFM Lens . Here I'm sharing the full research data behind the framework. In September 2025, I published the initial EFMNB methodology on Zenodo. Six months and 700+ evaluated responses later, here's what the data actually shows. Scale of the Research Experiment Prompts Repeats Total Evals Model Judge calibration v1-v2 (Logs v5-v8) 40+ varied ~190 Gemini Flash Lexeme experiments (3 batches) 30+ 3 ~90 Gemini Flash Domain generalization (P1) 10 3 30 Gemini Flash M-axis validation v1 (P2) 20 3 46* Gemini Flash M-axis revalidation v2 (P2) 20 3 59* Gemini Flash M-axis fixed responses (P2v3) 10 5 50 Gemini Flash M-axis extended output (P2v4) 20 3 60 Gemini Flash Cross-model validation (P5) 10 2 20 Gemini Pro Final 100-prompt validation 100 1 100 Gemini Flash Sensitivity analysis (P3) — — 76×4 configs recomputed Total ~700+ Some runs had JSON parse failures, noted with asterisk This isn't a cherry-picked demo. It's 6 months of iterative experim

I evaluated 700+ AI responses across 5 quality axes — here's the complete dataset and what it reveals

Related Articles

If a Model Update Can Kill Your Startup, It Was Never Your Business

Perseverance

Happy Friday :P

These Beats Headphones We Like Are $150 Off

9 Best Android Phones of 2026, Tested and Reviewed

Related Articles

News
If a Model Update Can Kill Your Startup, It Was Never Your Business
Medium Programming • 6h ago

News
Perseverance
Medium Programming • 7h ago

News
Happy Friday :P
Dev.to • 7h ago

News
These Beats Headphones We Like Are $150 Off
Wired • 7h ago

News
9 Best Android Phones of 2026, Tested and Reviewed
Wired • 7h ago