
I Ran the Same LLM Prompt Every Day for 30 Days: Here's What Changed
I ran the exact same prompt every day for 30 days on GPT-4o. I measured output similarity, cost, and latency. Here is what I found. The Experiment Same prompt: "Summarize this: [random 500-word article]" Same model: GPT-4o Same 10 articles per day 30 days of data I measured: Cosine similarity between outputs (was the meaning the same?) Token count (was the length the same?) Latency (was the speed the same?) Cost (was it consistent?) The Results Output Similarity The outputs were semantically similar but not identical. On a scale of 0 (completely different) to 1 (identical), daily similarity averaged 0.87. There were two days where similarity dropped below 0.75. Both times, the model had clearly changed its summarization style. Shorter summaries, more direct language. Token Count Average output tokens: 78 Standard deviation: 12 Range: 52-124 tokens Day 12 had unusually short outputs (52-61 tokens). Day 23 had unusually long outputs (98-124 tokens). Latency Average latency: 2.3 seconds S
Continue reading on Dev.to DevOps
Opens in a new tab

