
Ordered Retries in Kafka: The Bugs You'll Find in Production
In Part 1 , we introduced a "middle ground" for Kafka retries: when a message fails, lock its key, send the message to a retry topic, and let the main partition continue processing other keys. If you need a recap, the Confluent blog covers the pattern. Simple on a whiteboard. In production, things break in ways the diagrams don't show. This article covers implementation gaps that tutorials skip. The code examples are from kafka-resilience — simplified for clarity, but the edge cases are real. The Architecture Recap The goal: when a message fails, block only that key. Other keys keep flowing. To make this work across multiple consumer instances, we need three topics: Main Topic — Your business events. The consumer checks if a key is locked before processing. Retry Topic — Messages land here either because they failed or because a predecessor with the same key failed. A separate consumer processes them with backoff. When a message succeeds here, it releases the lock. Lock Topic (compacte
Continue reading on Dev.to
Opens in a new tab


