Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA
In Kubernetes v1.34, the Pod replacement policy feature has reached general availability (GA). This blog post describes the Pod replacement policy feature and how to use it in your Jobs. About Pod Replacement Policy By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp). As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism. For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time. This behavior works fine for many workloads, but it can cause problems in certain cases. For example, popular machine learning frameworks like TensorFlow and JAX expect exactly one Pod per worker index. If two Pods run at the same time, you might encounter errors such as: /job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4 Additionally, starting replacement Pods befor
Continue reading on Kubernetes Blog
Opens in a new tab



