
Stop Hand-Tuning ETL Batch Sizes. Use PID Control Instead.
You've done this before. You need to batch-process a large dataset. You pick a chunk size — maybe 1000 , maybe 10000 — run a quick test, it looks fine, and you ship it. Three weeks later, your pipeline is crawling at 15% CPU while you're paying for 8 cores. Or it's randomly OOM-crashing on Tuesday nights when the dataset is slightly wider than usual. This is the static batch size problem , and it's more expensive than most teams realize. What's actually happening When you hard-code a batch size, you're making a bet: "This number will be optimal on every run, on every machine, under every memory condition, forever." That's never true. The optimal chunk size is a function of: Current available memory How heavy the transformation is for this batch How many other jobs are competing for resources Row width variation in the dataset No static number wins across all these dimensions. You need continuous adaptation . Enter PID control PID (Proportional-Integral-Derivative) control is a feedback
Continue reading on Dev.to Python
Opens in a new tab



