
The Story of Making AI Indistinguishable from Humans: Implementing a Turing Test with LLM Judges
Starting from HL 4.1 The first prototype of human-persona scored 4.1 out of 10 on "human-likeness." This is a score far below the threshold between "AI-like" and "human-like." And I built it and scored it myself. It took 5 versions to raise this to HL 7.7. This article is the story of that journey—what I tried, what didn't work, and what worked dramatically. Evaluation Method: LLM Judge I had Claude Sonnet act as an "expert in distinguishing humans from AI" to score the outputs. JUDGE_PROMPT = """ You are an expert in distinguishing humans from AI. Evaluate the following message and respond with JSON only: { " human_likeness_score " : 1-10, " style_variation_rate " : 0.0-1.0, " timing_naturalness " : 1-10, " reason_human_likeness " : " Reason in one sentence " , " improvement_suggestion " : " Improvement suggestion in one sentence " } """ Three metrics: Metric Meaning Target Value HL (human_likeness_score) How non-AI-like it is 7.5 or higher SV (style_variation_rate) Not too homogeneou
Continue reading on Dev.to
Opens in a new tab




