
Benchmarking LFM2.5-Thinking on GSM8k (early result)
I have a secret passion for LFM2.5-Thinking. It's tiny 1.2B, it's fast, it's a reasoning model, and it's good. Really good. My tests are still in progress. All i can do is share some early results. I use the public GSM8k dataset, but with my own benchmarking scripts. What is the GSM8k benchmark? Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations. The dataset Some public benchmarks results The paper The Top 10 Leaderboard in 2026, up to 97%. Take note of the massive context size. And this what "State of the Art" results looked like in 2021, barely 35%. Some early results Questions: 1319 (test) Context sizes to test: [1000, 2000, 3000, 4000, 5000, 6000, 7000] Endpoint: http://192.168.1.110:8000 / lfm2.5-thinking === max_tokens=1000 === [200/1319] acc=135/200 (67.5%) rate=3.9q/s [400/1319] acc=251/400 (62.8%) rate=4.6q/s [600/1319] acc=387/600 (64.5%) rate=5.0q/s [8
Continue reading on Dev.to
Opens in a new tab



