
I Built a Dataset Version Control Tool — and Accidentally Reimplemented Git's Core
A few months ago I started a side project mostly to get hands-on experience with three things I hadn't used seriously before: Docker, SQLite, and Python CLI tools. The plan was to build something small for educational purposes. But the project turned into DataTracker — a local version control system for data files. This article is about the architecture, specifically the part that surprised me most once I started designing how to actually store versioned files, I kept arriving at the same solutions git already uses. Not because I copied them, but because they're probably the best answers to the problem. The Problem The use case is simple. You have a CSV, a set of images, or any data file really. You run some processing, the file changes. Later you want to know what it looked like before, compare the two versions, etc. You want this without manually copying files into data_v1/ , data_v2/ , data_final/ , data_final_REAL/ . Git solves this for source code. It does not solve it well for bi
Continue reading on Dev.to Python
Opens in a new tab




