-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
1. Proposed title of this feature request
Modernize Machine Config Daemon health monitoring using Startup/Readiness/Liveness probes
2. What is the nature and description of the request?
The Machine Config Daemon (MCD) currently experiences race conditions during cluster upgrades and MachineConfigPool (MCP) updates (specifically during rpm-ostree operations and kargs application).
Investigation into OCPBUGS-63699 and OCPBUGS-73824 reveals that the legacy liveness probe implementation is too aggressive and lacks awareness of the MCD’s internal state. When the MCD is performing long-running disk operations (e.g., updating /etc/kubernetes/manifests/ or executing a reboot pivot), the following occurs:
- The MCD process is busy or blocked by rpm-ostree execution.
- The liveness probe fails, causing the Kubelet to kill and restart the MCD container mid-operation.
- Resulting Inconsistency: The new MCD instance starts up, detects that the on-disk file state does not match the "current" rendered config (because the previous instance was killed before completion), and reports a "content mismatch" error, moving the node to a Degraded state.
Currently, the probe serves no traffic-routing purpose (as MCD exposes no service API) and fails its secondary purpose (restarting an "ill" process) by providing a false sense of health while being susceptible to timing out during critical operations.
Technical Issues Identified:
- False Positives: The current probe returns {{200
OK}} indefinitely after the first successful syncNode run. It does not monitor if informers or listers have hung in subsequent cycles.
- Race Conditions & Inconsistency: During rpm-ostree rebases or kernel-argument updates that exceed 210 seconds, the Kubelet kills the MCD. Because the MCD has already popped the update event from its queue, the restart leads to an incomplete on-disk state, "content mismatch" errors, and a Degraded node state.
- Implementation: The probe was originally introduced to "fix" hung reflectors, but the implementation only checks the first sync. It provides zero protection against further failures.
The request is to:
- Immediate Term: Remove the existing liveness probe until a state-aware version is implemented to prevent mid-update container kills.
- Long Term: Implement a proper probe (or probes) ** to handle the node sync considering the stateful nature of some operations.
-
- Monitor the health of Kubernetes informers/watchers.
-
- Suppress "Unhealthy" signals during active updates. If an rpm-ostree operation is in progress, the probe must either return success or be ignored to prevent the Kubelet from interrupting a "point-of-no-return" file write.
-
- Ensure the MCD lifecycle is decoupled from Kubelet restarts during critical "point-of-no-return" file writes.
3. Why does the customer need this? (List the business requirements here)
- Upgrade Reliability: Prevents clusters from becoming "stuck" in a Degraded state during standard maintenance windows (observed in 4.17 → 4.18 paths).
- Elimination of Manual Remediation: Currently, customers must manually stop kubelet.service, delete currentconfig, or copy decoded files back to nodes to resolve the mismatch. In clusters with 30+ nodes and 4-hour maintenance windows, this manual overhead is unacceptable.
- Stability for Configuration Changes: Customers applying standard Day 2 changes (like switching cgroup v1 to v2) are seeing MCD restart loops, which risks node instability and unexpected reboots.
4. List any affected packages or components.
- machine-config-operator: (Manifests for the MCD DaemonSet and probe definitions).
- machine-config-daemon: (Internal daemon.go logic and HTTP health handlers).
- openshift-hyperv-node / coreos: (Interaction with rpm-ostree during kargs/osUpdates).
- is related to
-
OCPBUGS-73824 Machine Config Daemon liveness probe failure when new MCP is deployed causing inconsistent state.
-
- Closed
-