
Spark ETL Framework: ETL Patterns Guide — Spark ETL Framework
ETL Patterns Guide — Spark ETL Framework A practical guide to building reliable, scalable data pipelines with the medallion architecture pattern. By Datanest Digital Medallion Architecture The medallion (multi-hop) architecture organises data into three layers: Layer Purpose Data Quality Schema Bronze Raw ingestion — land data as-is Unvalidated Inferred Silver Cleaned, conformed, deduplicated Validated Enforced Gold Business-level aggregates Trusted Optimised Why three layers? Auditability — Bronze retains the original data for replay or debugging. Decoupling — Consumers read from Gold; ingestion changes don't break dashboards. Quality escalation — Each layer adds more trust, caught by quality gates. Idempotency Every pipeline step should be safe to re-run without producing duplicates or corrupted state. Strategies Strategy When to use MERGE (upsert) SCD Type 1 — overwrite on natural key SCD Type 2 merge Need full history of changes Overwrite partition Gold aggregates partitioned by da
Continue reading on Dev.to Python
Opens in a new tab



