
LLMs Can't Grade Essays Like Humans — But Here's What AI Does Better (With Free API)
The Research Is In: LLMs Struggle at Essay Grading A new paper published on arXiv on March 24, 2026 drops a bombshell for anyone building AI-powered education tools: "LLMs Do Not Grade Essays Like Humans" . Researchers evaluated GPT and Llama family models against human graders in out-of-the-box settings — no fine-tuning, no task-specific training. The verdict? Agreement between LLM scores and human scores remains "relatively weak." Specifically, LLMs tend to over-score short or underdeveloped essays and under-score longer essays with minor grammatical errors . They follow coherent internal patterns — essays they praise tend to score higher — but those patterns diverge significantly from how human raters think. This is a wake-up call. But it's also a clarifying moment: it tells us exactly where AI should and shouldn't be deployed. What LLMs Are Actually Bad At Subjective evaluation — Grading requires nuanced human judgment that LLMs can't reliably replicate Rubric-based scoring — LLMs
Continue reading on Dev.to Python
Opens in a new tab



