
When More Examples Make Your LLM Worse: Discovering Few-Shot Collapse
Here's something everyone agrees on about few-shot prompting: give the model more examples, it performs better. I believed that too. Then I measured it. So I built AdaptGauge , an open-source tool that measures how efficiently LLMs learn from few-shot examples. What I tested I evaluated eight models across four tasks designed to mirror real business scenarios, at shot counts of 0, 1, 2, 4, and 8: Classification — Categorize customer support inquiries into one of 8 categories (billing, technical support, returns, etc.) Code Fix — Identify and fix bugs in short Python functions (off-by-one errors, missing edge cases) Summarization — Extract key points from Japanese news articles into bullet-point summaries Route Optimization — Calculate optimal delivery routes across multiple destinations with time windows and fuel costs Models tested: Cloud APIs : Claude Haiku 4.5, Claude Opus 4.5, Gemini 2.5 Flash, Gemini 3 Flash, Gemini 3 Pro Local models : Gemma 3 27B, GPT-OSS 120B, Qwen3-VL 8B For e
Continue reading on Dev.to
Opens in a new tab




