
From Dirty CSV to Golden Records: A Python Walkthrough
Download a government CSV, load it into pandas, and you'll find "MEMORIAL HOSPITAL" listed twelve times across six states. Run drop_duplicates() — it finds zero exact copies. Try deduplicating on facility name alone — it merges hospitals that are genuinely different. Data cleaning and deduplication in Python requires more than one-liners. It requires a coordinated pipeline that profiles, cleans, and matches records in sequence. This post walks through that full journey on 5,426 real CMS hospital records. We'll run three approaches — zero-config, explicit tuning, and LLM-assisted — and compare what each one catches, what it misses, and why. By the end, you'll have a repeatable pipeline for any dirty CSV. The Dataset The CMS Hospital General Information file is a public dataset from data.cms.gov listing every Medicare-certified hospital in the United States. We downloaded the April 2026 snapshot. df = pl . read_csv ( " hospitals.csv " ) print ( df . shape ) # (5426, 38) 5,426 rows. 38 co
Continue reading on Dev.to
Opens in a new tab

