
What Broke When We Pushed WebSockets From 100k to 1M Users
A post-mortem on OOM kills, GC pauses, and the slow consumers that ate our RAM. What Broke When We Pushed WebSockets From 100k to 1M Users A post-mortem on OOM kills, GC pauses, and the slow consumers that ate our RAM. We thought we had a leak. Turns out, we just didn’t know how to turn off the tap. At 100k users, everything looked perfect. The dashboards were green, latency was flat, and we felt like geniuses. At 1M users, the exact same architecture started killing nodes like clockwork. We were building a live commentary platform for a massive sports event. The premise was simple: ingest scores, push them to the browser. We tested it. We load-tested it. We thought we were ready. Then the finals started. The user count ticked past 300k, and latency jittered. By 600k, the alerts weren’t just pinging; they were screaming. By 800k, our nodes turned into zombies — connected, technically “alive,” but totally unresponsive — before being abruptly shot in the head by the Linux Out-Of-Memory (
Continue reading on Dev.to
Opens in a new tab




