
How to Size a Spark Cluster. And How Not To.
Interviewer: You need to process 1 TB of data in Spark. How do you size the cluster? Most answers start with division. 1 TB → choose 128 MB partitions → calculate ~8,000 partitions → map to cores → decide number of nodes It is clean. It is logical. It is also incomplete. Because cluster size is not derived from data size. It is derived from workload behavior. Here is how I actually answer this question in production. Step 1: Clarify Which “1 TB” We’re Talking About When someone says “1 TB,” there are multiple meanings hiding inside that number. Before sizing anything, I separate at least five different sizes. 1. Stored Size on Disk 1 TB compressed Parquet in object storage tells me very little. Columnar formats like Parquet are compressed and encoded. That size reflects storage efficiency, not runtime footprint. 2. Effective Scan Size After Pruning Are we scanning the entire dataset? Or is the query using: Partition pruning , which skips directory partitions based on filters Predicate
Continue reading on Dev.to
Opens in a new tab




