Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-73824

Machine Config Daemon liveness probe failure when new MCP is deployed causing inconsistent state.

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      h2 .Description of problem:

      After updating to 4.17.41, the customer replaced the cGroup using a similar method as described in the documentation 8.6. Configuring the Linux cgroup version on your nodes.
      During the kargs updates, we noticed that the Machine Config Daemon (MCD) were restarted with 2 different behaviours:

      1. MCD Race Condition

      MCD Race condition
      [vlours@supportshell-1 ~]$ grep -E "machine-config-daemon\[|kubelet.service|Container machine-config-daemon failed|reboot" 04319737/0050-sosreport-caasntfaac02s2-622cw-master-2-04319737-2026-01-14-iiuyktd.tar.xz/sosreport-caasntfaac02s2-622cw-master-2-04319737-2026-01-14-iiuyktd/sos_commands/logs/journalctl_--no-pager_--boot_-1
      
      [...]
      
      Jan 14 12:13:20 caasntfaac02s2-622cw-master-2 root[843234]: machine-config-daemon[842644]: "drain is already completed on this node"
      Jan 14 12:13:35 caasntfaac02s2-622cw-master-2 root[845208]: machine-config-daemon[845092]: "Starting to manage node: caasntfaac02s2-622cw-master-2"
      Jan 14 12:13:35 caasntfaac02s2-622cw-master-2 root[845245]: machine-config-daemon[842644]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=0 --delete=systemd.legacy_systemd_cgroup_controller=1 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1=\"all\" --append=psi=0]"
      Jan 14 12:13:45 caasntfaac02s2-622cw-master-2 root[845621]: machine-config-daemon[845092]: "Validated on-disk state"
      Jan 14 12:13:45 caasntfaac02s2-622cw-master-2 root[845622]: machine-config-daemon[845092]: "Starting update from rendered-master-0bca4a35d33eec735db42704ad626ba8 to rendered-master-6718cba8befccaff7d9a4f40df62766b: &{osUpdate:false kargs:true fips:false passwd:false files:false units:false kernelType:false extensions:false}"
      Jan 14 12:13:45 caasntfaac02s2-622cw-master-2 root[845623]: machine-config-daemon[845092]: "drain is already completed on this node"
      Jan 14 12:14:00 caasntfaac02s2-622cw-master-2 root[847645]: machine-config-daemon[847507]: "Starting to manage node: caasntfaac02s2-622cw-master-2"
      Jan 14 12:14:00 caasntfaac02s2-622cw-master-2 root[847678]: machine-config-daemon[845092]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=0 --delete=systemd.legacy_systemd_cgroup_controller=1 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1=\"all\" --append=psi=0]"
      Jan 14 12:14:02 caasntfaac02s2-622cw-master-2 systemd[1]: kubelet.service: Deactivated successfully.
      Jan 14 12:14:02 caasntfaac02s2-622cw-master-2 systemd[1]: kubelet.service: Consumed 6h 21min 26.172s CPU time.
      Jan 14 12:14:06 caasntfaac02s2-622cw-master-2 root[847923]: machine-config-daemon[847507]: "Validated on-disk state"
      Jan 14 12:14:06 caasntfaac02s2-622cw-master-2 root[847924]: machine-config-daemon[847507]: "Starting update from rendered-master-0bca4a35d33eec735db42704ad626ba8 to rendered-master-6718cba8befccaff7d9a4f40df62766b: &{osUpdate:false kargs:true fips:false passwd:false files:false units:false kernelType:false extensions:false}"
      Jan 14 12:14:06 caasntfaac02s2-622cw-master-2 root[847925]: machine-config-daemon[847507]: "drain is already completed on this node"
      Jan 14 12:14:21 caasntfaac02s2-622cw-master-2 root[849744]: machine-config-daemon[847507]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=0 --delete=systemd.legacy_systemd_cgroup_controller=1 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1=\"all\" --append=psi=0]"
      Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 root[849824]: machine-config-daemon[847507]: "Performing post config change action: Reboot for config rendered-master-6718cba8befccaff7d9a4f40df62766b"
      Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 root[849825]: machine-config-daemon[847507]: "Rebooting node"
      Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 root[849826]: machine-config-daemon[847507]: "initiating reboot: Node will reboot into config rendered-master-6718cba8befccaff7d9a4f40df62766b"
      Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-6718cba8befccaff7d9a4f40df62766b.
      Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 root[849829]: machine-config-daemon[847507]: "reboot successful"
      Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd-logind[1164]: The system will reboot now!
      Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd-logind[1164]: System is rebooting.
      Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd[1]: machine-config-daemon-reboot.service: Deactivated successfully.
      Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-6718cba8befccaff7d9a4f40df62766b.
      

      In this case, a new container was starting on the node, while the old one was trying to apply the rpm-ostree (cf MCD process IDs *845092* && *847507* as examples).
      We have to stop kubelet.service for the MCD to keep running and finalize the update.
      In this scenario the command triggering the issue was Running rpm-ostree [kargs ...

      2. MCD Liveness failure

      MCD Liveness failure
      [vlours@supportshell-1 ~]$ grep -E "Container machine-config-daemon failed liveness probe, will be restarted" 04319737/0070-before-fix-sosreport-caasntfaac02s2-622cw-ms-worker-t9bx8-04319737-2026-01-15-elaltrv.tar.xz/sosreport-caasntfaac02s2-622cw-ms-worker-t9bx8-04319737-2026-01-15-elaltrv/sos_commands/logs/journalctl_--no-pager_--boot
      Jan 15 12:57:13 caasntfaac02s2-622cw-ms-worker-t9bx8 kubenswrapper[3304]: I0115 12:57:13.135447    3304 kuberuntime_manager.go:1025] "Message for Container of pod" containerName="machine-config-daemon" containerStatusID={"Type":"cri-o","ID":"d3a987f37967658bddced56d59a24c822c5fa39b0f932af47e5f37d90cb34174"} pod="openshift-machine-config-operator/machine-config-daemon-xsvb8" containerMessage="Container machine-config-daemon failed liveness probe, will be restarted"
      Jan 15 13:00:13 caasntfaac02s2-622cw-ms-worker-t9bx8 kubenswrapper[3304]: I0115 13:00:13.134650    3304 kuberuntime_manager.go:1025] "Message for Container of pod" containerName="machine-config-daemon" containerStatusID={"Type":"cri-o","ID":"131014e4defa319e7d09b2245ecfe4aac725c8b692681132148475d44296b3fe"} pod="openshift-machine-config-operator/machine-config-daemon-xsvb8" containerMessage="Container machine-config-daemon failed liveness probe, will be restarted"
      Jan 15 13:03:13 caasntfaac02s2-622cw-ms-worker-t9bx8 kubenswrapper[3304]: I0115 13:03:13.134099    3304 kuberuntime_manager.go:1025] "Message for Container of pod" containerName="machine-config-daemon" containerStatusID={"Type":"cri-o","ID":"d27ec6bc698e8a7a606e2303d1ef679e85a3a687f3d961465338340e1fbc58e5"} pod="openshift-machine-config-operator/machine-config-daemon-xsvb8" containerMessage="Container machine-config-daemon failed liveness probe, will be restarted"
      
      Remediation stopping Kubelet
      [vlours@supportshell-1 ~]$ grep -E "reboot|Container machine-config-daemon failed liveness probe|kubelet.service" 04319737/0080-after-fix-sosreport-caasntfaac02s2-622cw-ms-worker-t9bx8-04319737-2026-01-15-uqeafcx.tar.xz/sosreport-caasntfaac02s2-622cw-ms-worker-t9bx8-04319737-2026-01-15-uqeafcx/sos_commands/logs/journalctl_--no-pager_--boot_-1
      
      [...]
      
      Jan 15 13:29:43 caasntfaac02s2-622cw-ms-worker-t9bx8 kubenswrapper[3304]: I0115 13:29:43.135986    3304 kuberuntime_manager.go:1025] "Message for Container of pod" containerName="machine-config-daemon" containerStatusID={"Type":"cri-o","ID":"fcc35f7569c67dfb3bb22b105183b408952ea7237591ba02a0024774a320f173"} pod="openshift-machine-config-operator/machine-config-daemon-xsvb8" containerMessage="Container machine-config-daemon failed liveness probe, will be restarted"
      Jan 15 13:35:12 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: kubelet.service: Deactivated successfully.
      Jan 15 13:35:12 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: kubelet.service: Consumed 1h 28min 57.089s CPU time.
      Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 root[3222868]: machine-config-daemon[3218343]: "initiating reboot: Node will reboot into config rendered-worker-586f3d0ccf00ca3fe223a1d378905089"
      Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-586f3d0ccf00ca3fe223a1d378905089.
      Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 root[3222871]: machine-config-daemon[3218343]: "reboot successful"
      Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd-logind[1324]: The system will reboot now!
      Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd-logind[1324]: System is rebooting.
      Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: machine-config-daemon-reboot.service: Deactivated successfully.
      Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-worker-586f3d0ccf00ca3fe223a1d378905089.
      

      In this scenario, the issue was related to the image in the file */etc/kubernetes/manifests/coredns.yaml*, already updated with the 4.18 image when the rest of the config seems to be in 4.17

      2026-01-15T13:02:18.141925704+11:00 E0115 13:02:18.141897 3144983 writer.go:231] Marking Degraded due to: "unexpected on-disk state validating against rendered-worker-[....]: content mismatch for file \"/etc/kubernetes/manifests/coredns.yaml\""
      2026-01-15T13:03:13.140601142+11:00 I0115 13:03:13.140483 3144983 daemon.go:1400] Shutting down MachineConfigDaemon
      

      Version-Release number of selected component (if applicable):

      The first scenario occurred when the cluster was after the customer completed successfully their update from 4.16 to 4.17.41, and when they were transitioning the cgroup form v1 to v2.

      The second scenario occurred on ALL compute nodes when updating from 4.17.41 to 4.18.26.

      How reproducible:

      We have tested the 4.17.41 case trying to apply the cgroup's kargs, but we have not been able to reproduce the issue in a STD AWS cluster.
      The customer has multiple Operator installed, and this is an vSphere IPI installation.

      Steps to Reproduce:

      1. Unknown
      2.
      3.

      Actual results:

      MCD is running without any issue until the MCO is triggering the new MCP deployment on the node. From there the container is restarting.
      In order to continue the MCP deployment on the node, the kubelet.service has to be stopped, meaning that this is certainly related to the liveness probe in both case.

      Expected results:

      Smooth MCP deployment.

      Additional info:

      the KCS 7136427 has been created to describe the issue.

              team-mco Team MCO
              rhn-support-vlours Vincent Lours
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: