I Ran the Same LLM Prompt Every Day for 30 Days: Here's What Changed

I ran the exact same prompt every day for 30 days on GPT-4o. I measured output similarity, cost, and latency. Here is what I found. The Experiment Same prompt: "Summarize this: [random 500-word article]" Same model: GPT-4o Same 10 articles per day 30 days of data I measured: Cosine similarity between outputs (was the meaning the same?) Token count (was the length the same?) Latency (was the speed the same?) Cost (was it consistent?) The Results Output Similarity The outputs were semantically similar but not identical. On a scale of 0 (completely different) to 1 (identical), daily similarity averaged 0.87. There were two days where similarity dropped below 0.75. Both times, the model had clearly changed its summarization style. Shorter summaries, more direct language. Token Count Average output tokens: 78 Standard deviation: 12 Range: 52-124 tokens Day 12 had unusually short outputs (52-61 tokens). Day 23 had unusually long outputs (98-124 tokens). Latency Average latency: 2.3 seconds S

I Ran the Same LLM Prompt Every Day for 30 Days: Here's What Changed

Related Articles

PROGRAMMING LANGUAGES

THE ULTIMATE MONEY MACHINE

I Got This DP Problem Wrong — Here’s Why

PDP’s wireless guitar controller has returned to its best price to date

From error-handling to structured concurrency

Related Articles

News
PROGRAMMING LANGUAGES
Medium Programming • 12m ago

News
THE ULTIMATE MONEY MACHINE
Medium Programming • 31m ago

News
I Got This DP Problem Wrong — Here’s Why
Medium Programming • 34m ago

News
PDP’s wireless guitar controller has returned to its best price to date
The Verge • 36m ago

News
From error-handling to structured concurrency
Lobsters • 42m ago