Back to articles
NewsDevOps

Kubernetes v1.34: Pods Report DRA Resource Health

via Kubernetes Blog

The rise of AI/ML and other high-performance workloads has made specialized hardware like GPUs, TPUs, and FPGAs a critical component of many Kubernetes clusters. However, as discussed in a previous blog post about navigating failures in Pods with devices , when this hardware fails, it can be difficult to diagnose, leading to significant downtime. With the release of Kubernetes v1.34, we are excited to announce a new alpha feature that brings much-needed visibility into the health of these devices. This work extends the functionality of KEP-4680 , which first introduced a mechanism for reporting the health of devices managed by Device Plugins. Now, this capability is being extended to Dynamic Resource Allocation (DRA) . Controlled by the ResourceHealthStatus feature gate, this enhancement allows DRA drivers to report device health directly into a Pod's .status field, providing crucial insights for operators and developers. Why expose device health in Pod status? For stateful application

Continue reading on Kubernetes Blog

Opens in a new tab

Read Full Article
4 views

Related Articles