Back to articles
I Ripped Out Docker Compose from Our ML Platform and Put Everything on EKS. Here's What Actually Happened.
NewsDevOps

I Ripped Out Docker Compose from Our ML Platform and Put Everything on EKS. Here's What Actually Happened.

via Dev.toRamchandra Reddy

I'll be honest — I resisted this for longer than I should have. Our ML pipeline on Docker Compose was working. Not perfectly, but it was working. I knew where everything lived. I could debug it. The data science team understood it. And every time someone suggested moving to Kubernetes, I'd think "that's a lot of complexity for a problem we don't have yet." Then we had the problem. Three data scientists started running concurrent training jobs. One job consumed all GPU memory and the other two silently failed with zero useful error messages. Our serving container kept getting OOMKilled under load and nobody knew why because there was no proper metrics collection. We had a Friday afternoon incident where a model that had been in production for 4 months started returning garbage predictions — turned out the feature distribution had shifted weeks earlier and we had no monitoring to catch it. We only found out when the product team noticed the fraud catch rate had dropped. That was the mome

Continue reading on Dev.to

Opens in a new tab

Read Full Article
3 views

Related Articles