
Hardcore ETL: Taming 5GB+ of Apple Health XML Data with DuckDB and dbt
So, you decided to export your Apple Health data. You expected a neat CSV or a friendly JSON, but instead, you were greeted by a massive, bloated 5GB+ XML file that makes Excel cry and VS Code freeze. In this guide, we are building a high-performance ETL pipeline to transform that chaotic XML into a structured Personal Data Warehouse . We’ll be using the "Modern Data Stack for local machines": DuckDB for lightning-fast processing, dbt for modeling, and Apache Parquet for efficient storage. By the end of this, you'll be performing Data Engineering on your own heartbeat, steps, and sleep patterns like a pro. The Architecture: From Raw Pixels to Structured SQL Before we dive into the code, let's look at the data flow. We need to move from a hierarchical, redundant XML format to a columnar, analytical format. graph TD A[Apple Health Export.xml] -->|Python Streaming Parser| B(Apache Parquet) B -->|DuckDB External Table| C[dbt Seed/Stage] C -->|SQL Transformation| D[dbt Marts: Daily Metrics]
Continue reading on Dev.to Webdev
Opens in a new tab




