
I Deduplicated 100K Records in 12 Seconds With One Command
My CSV had duplicates. A lot of them. "John Smith" and "Jon Smith" were the same person. So were " john.smith@gmail.com " and " jsmith@gmail.com ." And "(555) 012-3456" and "5550123456." I didn't want to write 60 lines of Python to find them. So I built a tool that does it in one command. pip install goldenmatch goldenmatch dedupe customers.csv That's it. No config file. No training data. No manual labeling. GoldenMatch reads your CSV, figures out which columns are names, emails, phones, and addresses, picks the best matching algorithm for each, and clusters the duplicates. On 100,000 records, it finishes in 12.78 seconds . What Just Happened? When you run goldenmatch dedupe , here's the pipeline: Read File → Auto-Detect Columns → Pick Scorers → Block → Score → Cluster → Golden Records Auto-detection looks at column names and data patterns. A column called "email" with values containing @ gets routed to exact + Levenshtein matching. A column called "name" gets Jaro-Winkler + token sort
Continue reading on Dev.to Tutorial
Opens in a new tab



