
Post-Mortem: Diagnosing and Fixing Ceph Quorum Breakdown (Time Drift) in a Hybrid Proxmox VE Cluster
Key Takeaways & Conclusions: Root Cause: A Time Drift between the physical Proxmox host and nested/external virtual machines exceeded Ceph's strict 0.05s tolerance limit. Impact: The time mismatch caused a continuous authorization loop ( electing / probing ) in the local Ceph Monitor (MON), corrupting its local RocksDB database and triggering a cascading failure of the dependent OSD.0 daemon. Resolution: Enforced rigorous time synchronization via Chrony and QEMU Guest Agent , followed by a destructive purge and rebuild of the corrupted MON daemon using the surviving quorum. Verification: Chaos Engineering tests ( rados bench ) confirmed that with a min_size = 2 replication policy, the cluster successfully survives a hard node power-off without dropping client I/O. 1. The Architecture and The Incident Building a Ceph cluster typically requires dedicated bare-metal servers. My home lab topology, however, is a hybrid 3-node setup designed to simulate a full cluster using limited hardware:
Continue reading on Dev.to Webdev
Opens in a new tab




