
When Synthetic Data Lies: A Hidden Correlation Problem I Didn’t Expect
While working on a small analytics setup using ClickHouse and Superset, I generated some synthetic data to test queries and dashboards. Initially, everything looked fine. The distributions seemed reasonable, and the dashboards behaved as expected. But as I increased the dataset size, a few patterns started to look off. Revenue seemed to concentrate in a single country. In some cases, certain countries had no purchases at all. At first, it looked like a simple distribution issue. But the patterns were too consistent to ignore. Checking the Usual Suspects The first assumption was that something was wrong with the queries or aggregations. So I checked: query logic filters materialized views dashboard configurations Everything seemed correct. Which pointed to a different possibility: The issue wasn’t in how the data was queried - it was in how the data was generated. Looking at the Data More Closely Instead of relying on dashboards, I went back to the raw data. A simple aggregation made th
Continue reading on Dev.to
Opens in a new tab


