
LLMs Don't Grade Essays Like Humans — But Here's What They're Actually Good At (API Tutorial)
arXiv Bombshell: LLMs Fail at Essay Grading On March 24, 2026, researchers published a paper making waves in academic and developer circles: "LLMs Do Not Grade Essays Like Humans" . The study evaluated GPT and Llama family models on automated essay scoring (AES) in out-of-the-box settings — no fine-tuning, no task-specific prompting. The finding: agreement between LLM scores and human scores remains relatively weak . LLMs tend to assign higher scores to short or underdeveloped essays, while penalizing longer essays with minor grammatical errors. The models follow internally coherent patterns, but those patterns don't align with how human raters actually think. What This Means for Developers This doesn't mean LLMs are useless for education or writing tools. It means developers need to use them for the right tasks . What LLMs ARE reliable for in writing contexts: Essay generation and variation — Creating draft content, generating multiple versions, producing training data at scale Writin
Continue reading on Dev.to Python
Opens in a new tab



