Back to articles
I evaluated 700+ AI responses across 5 quality axes — here's the complete dataset and what it reveals

I evaluated 700+ AI responses across 5 quality axes — here's the complete dataset and what it reveals

via Dev.toАрсений Перель

This is a follow-up to my previous post about TRI·TFM Lens . Here I'm sharing the full research data behind the framework. In September 2025, I published the initial EFMNB methodology on Zenodo. Six months and 700+ evaluated responses later, here's what the data actually shows. Scale of the Research Experiment Prompts Repeats Total Evals Model Judge calibration v1-v2 (Logs v5-v8) 40+ varied ~190 Gemini Flash Lexeme experiments (3 batches) 30+ 3 ~90 Gemini Flash Domain generalization (P1) 10 3 30 Gemini Flash M-axis validation v1 (P2) 20 3 46* Gemini Flash M-axis revalidation v2 (P2) 20 3 59* Gemini Flash M-axis fixed responses (P2v3) 10 5 50 Gemini Flash M-axis extended output (P2v4) 20 3 60 Gemini Flash Cross-model validation (P5) 10 2 20 Gemini Pro Final 100-prompt validation 100 1 100 Gemini Flash Sensitivity analysis (P3) — — 76×4 configs recomputed Total ~700+ Some runs had JSON parse failures, noted with asterisk This isn't a cherry-picked demo. It's 6 months of iterative experim

Continue reading on Dev.to

Opens in a new tab

Read Full Article
2 views

Related Articles