
PECOS Data Extraction Pipeline - DevOps Documentation
Overview The PECOS Data Extraction Pipeline is an enterprise-grade ETL workflow that extracts, transforms, and loads healthcare provider data from CMS PECOS datasets. The pipeline processes four datasets (Clinicians, Practices, Canonical Providers, Detail Tables) in parallel using PySpark and AWS Glue, orchestrated by AWS Step Functions. Repository : https://github.com/durrello/PECOS-Data-Extraction-Pipeline Architecture AWS Architecture (Primary) ┌─────────────────────────────────────┐ │ AWS Step Functions │ │ State Machine Orchestrator │ └───────────────┬─────────────────────┘ │ [ValidateInput] ← Pass State │ ┌───────────────▼─────────────────────┐ │ ExtractParallel │ ← Parallel State │ ┌──────┐ ┌──────┐ ┌──────┐ ┌────┐ │ │ │Clin. │ │Prac. │ │Canon.│ │Det.│ │ ← Task States │ │ ETL │ │ ETL │ │ ETL │ │ETL │ │ │ └──────┘ └──────┘ └──────┘ └────┘ │ │ ↺ 3 retries per branch │ └───────────────┬─────────────────────┘ │ [AllSucceeded?] ← Choice State / \ [NotifySuccess] [NotifyFailure] │ │ [
Continue reading on Dev.to
Opens in a new tab



