Back to articles
I Tested 12 LLMs With Few-Shot Examples. The Results Were Not What I Expected.

I Tested 12 LLMs With Few-Shot Examples. The Results Were Not What I Expected.

via Dev.toShuntaro Okuma

In a previous article , I tested 8 models across 4 tasks and reported on "few-shot collapse" — cases where adding few-shot examples actually degrades LLM performance. This time, I expanded the experiment to 12 models (6 cloud + 6 local) and 5 tasks to see whether those findings hold at a larger scale. They do — and I found even more dramatic cases, including a model that dropped from 93% to 30% with more examples. What I tested I evaluated 12 models — 6 cloud APIs and 6 local models — across 5 tasks designed to mirror real business scenarios. Cloud models: Claude Haiku 4.5, Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3 Flash, GPT-4o-mini, GPT-5.4-mini Local models: Gemma 3 27B, LLaMA 4 Scout (17B active, MoE), GPT-OSS 120B, Qwen 3.5 (35B total / 3B active, MoE), Ministral 3 14B Reasoning, Phi-4 Reasoning Plus Tasks: Classification — Categorize customer support inquiries into specific categories (exact match scoring) Code Fix — Identify and fix bugs in short Python functions Route Optim

Continue reading on Dev.to

Opens in a new tab

Read Full Article
6 views

Related Articles