I Tested 12 LLMs With Few-Shot Examples. The Results Were Not What I Expected.

In a previous article , I tested 8 models across 4 tasks and reported on "few-shot collapse" — cases where adding few-shot examples actually degrades LLM performance. This time, I expanded the experiment to 12 models (6 cloud + 6 local) and 5 tasks to see whether those findings hold at a larger scale. They do — and I found even more dramatic cases, including a model that dropped from 93% to 30% with more examples. What I tested I evaluated 12 models — 6 cloud APIs and 6 local models — across 5 tasks designed to mirror real business scenarios. Cloud models: Claude Haiku 4.5, Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3 Flash, GPT-4o-mini, GPT-5.4-mini Local models: Gemma 3 27B, LLaMA 4 Scout (17B active, MoE), GPT-OSS 120B, Qwen 3.5 (35B total / 3B active, MoE), Ministral 3 14B Reasoning, Phi-4 Reasoning Plus Tasks: Classification — Categorize customer support inquiries into specific categories (exact match scoring) Code Fix — Identify and fix bugs in short Python functions Route Optim

I Tested 12 LLMs With Few-Shot Examples. The Results Were Not What I Expected.

Related Articles

The Samsung S95F OLED is one of our highest-rated TVs - and it's $800 off at Amazon

These warning signs could mean spyware is on your phone - and 9 ways to keep it secure

Which E-Readers I'd Recommend Buying in Amazon's Spring Sale

This Premium Gaming Headset Is $80 Off on Amazon

I Stopped Trying to Be a “Good Developer” and Everything Got Better

Related Articles

News
The Samsung S95F OLED is one of our highest-rated TVs - and it's $800 off at Amazon
ZDNet • 10m ago

News
These warning signs could mean spyware is on your phone - and 9 ways to keep it secure
ZDNet • 32m ago

News
Which E-Readers I'd Recommend Buying in Amazon's Spring Sale
Wired • 1h ago

News
This Premium Gaming Headset Is $80 Off on Amazon
Wired • 1h ago

News
I Stopped Trying to Be a “Good Developer” and Everything Got Better
Medium Programming • 1h ago