
How to Build Workflows That Never Lose Progress
The Half-Deployed Model Imagine you're running an ML platform. A weekly cron job fires at 3 AM to retrain a customer's model. The pipeline has five steps: Generate training data from BigQuery Train the model on a Kubernetes cluster Push the model artifact to a registry Create a scoring configuration in the scoring service database Authorize the model for the customer's traffic Steps 1 through 3 take about two hours and cost real money — compute time, BigQuery slots, container images. At 5:02 AM, step 3 completes. The model is trained and pushed. Step 4 calls the scoring service to create the config. The scoring service is in the middle of a routine database migration. Connection refused. Now you have a problem. The model is sitting in the artifact registry, trained and ready. But it can't serve traffic because there's no scoring config. The pipeline marks the whole run as "FAILED." What happens next depends on how you built the system. If you start over: The 6 AM retry re-runs from ste
Continue reading on Dev.to DevOps
Opens in a new tab




