I Wrote 82 Regex Replacements to Parse 6,933 Time Format Variations From a Government Dataset
Note: This article is also available in Japanese . The Setup Japan's Ministry of Health publishes a list of ~10,000 pharmacies that dispense emergency contraception. I built a search tool for it. The dataset has a hours field. Business hours. How bad could it be? Mon-Fri:9:00-18:00,Sat:9:00-13:00 Split on , , split on : , parse the range. One regex. Done. First version coverage: 89.4%. Over 10% of entries failed to parse. Here's why. The Horror: Free-Text Entry With No Schema There's no format specification. Each pharmacy across 47 prefectures types whatever they want. Here are real entries that all mean "Monday to Friday, 9:00 to 18:00": 月-金:9:00-18:00 ← clean 月~金:9:00~18:00 ← full-width everything ⽉-⾦:9:00-18:00 ← ...what? 月曜日~金曜日 9時~18時 ← kanji time notation (月火水木金)9:00-18:00 ← parenthesis grouping 平日:9:00-18:00 ← "weekdays" in Japanese 月から金は9時から18時 ← literal prose All the same meaning. My job: funnel all of these into a single canonical form. The function that does this calls .repl
Continue reading on Dev.to Webdev
Opens in a new tab




