Loading...

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.17.z, 4.18.z
Component/s: Machine Config Operator / platform-vsphere
Labels:
- mco-triaged

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

h2 .Description of problem:

After updating to 4.17.41, the customer replaced the cGroup using a similar method as described in the documentation 8.6. Configuring the Linux cgroup version on your nodes.
During the kargs updates, we noticed that the Machine Config Daemon (MCD) were restarted with 2 different behaviours:

1. MCD Race Condition

MCD Race condition

[vlours@supportshell-1 ~]$ grep -E "machine-config-daemon\[|kubelet.service|Container machine-config-daemon failed|reboot" 04319737/0050-sosreport-caasntfaac02s2-622cw-master-2-04319737-2026-01-14-iiuyktd.tar.xz/sosreport-caasntfaac02s2-622cw-master-2-04319737-2026-01-14-iiuyktd/sos_commands/logs/journalctl_--no-pager_--boot_-1

[...]

Jan 14 12:13:20 caasntfaac02s2-622cw-master-2 root[843234]: machine-config-daemon[842644]: "drain is already completed on this node"
Jan 14 12:13:35 caasntfaac02s2-622cw-master-2 root[845208]: machine-config-daemon[845092]: "Starting to manage node: caasntfaac02s2-622cw-master-2"
Jan 14 12:13:35 caasntfaac02s2-622cw-master-2 root[845245]: machine-config-daemon[842644]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=0 --delete=systemd.legacy_systemd_cgroup_controller=1 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1=\"all\" --append=psi=0]"
Jan 14 12:13:45 caasntfaac02s2-622cw-master-2 root[845621]: machine-config-daemon[845092]: "Validated on-disk state"
Jan 14 12:13:45 caasntfaac02s2-622cw-master-2 root[845622]: machine-config-daemon[845092]: "Starting update from rendered-master-0bca4a35d33eec735db42704ad626ba8 to rendered-master-6718cba8befccaff7d9a4f40df62766b: &{osUpdate:false kargs:true fips:false passwd:false files:false units:false kernelType:false extensions:false}"
Jan 14 12:13:45 caasntfaac02s2-622cw-master-2 root[845623]: machine-config-daemon[845092]: "drain is already completed on this node"
Jan 14 12:14:00 caasntfaac02s2-622cw-master-2 root[847645]: machine-config-daemon[847507]: "Starting to manage node: caasntfaac02s2-622cw-master-2"
Jan 14 12:14:00 caasntfaac02s2-622cw-master-2 root[847678]: machine-config-daemon[845092]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=0 --delete=systemd.legacy_systemd_cgroup_controller=1 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1=\"all\" --append=psi=0]"
Jan 14 12:14:02 caasntfaac02s2-622cw-master-2 systemd[1]: kubelet.service: Deactivated successfully.
Jan 14 12:14:02 caasntfaac02s2-622cw-master-2 systemd[1]: kubelet.service: Consumed 6h 21min 26.172s CPU time.
Jan 14 12:14:06 caasntfaac02s2-622cw-master-2 root[847923]: machine-config-daemon[847507]: "Validated on-disk state"
Jan 14 12:14:06 caasntfaac02s2-622cw-master-2 root[847924]: machine-config-daemon[847507]: "Starting update from rendered-master-0bca4a35d33eec735db42704ad626ba8 to rendered-master-6718cba8befccaff7d9a4f40df62766b: &{osUpdate:false kargs:true fips:false passwd:false files:false units:false kernelType:false extensions:false}"
Jan 14 12:14:06 caasntfaac02s2-622cw-master-2 root[847925]: machine-config-daemon[847507]: "drain is already completed on this node"
Jan 14 12:14:21 caasntfaac02s2-622cw-master-2 root[849744]: machine-config-daemon[847507]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=0 --delete=systemd.legacy_systemd_cgroup_controller=1 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1=\"all\" --append=psi=0]"
Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 root[849824]: machine-config-daemon[847507]: "Performing post config change action: Reboot for config rendered-master-6718cba8befccaff7d9a4f40df62766b"
Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 root[849825]: machine-config-daemon[847507]: "Rebooting node"
Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 root[849826]: machine-config-daemon[847507]: "initiating reboot: Node will reboot into config rendered-master-6718cba8befccaff7d9a4f40df62766b"
Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-6718cba8befccaff7d9a4f40df62766b.
Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 root[849829]: machine-config-daemon[847507]: "reboot successful"
Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd-logind[1164]: The system will reboot now!
Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd-logind[1164]: System is rebooting.
Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd[1]: machine-config-daemon-reboot.service: Deactivated successfully.
Jan 14 12:14:24 caasntfaac02s2-622cw-master-2 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-6718cba8befccaff7d9a4f40df62766b.

In this case, a new container was starting on the node, while the old one was trying to apply the rpm-ostree (cf MCD process IDs *845092* && *847507* as examples).
We have to stop kubelet.service for the MCD to keep running and finalize the update.
In this scenario the command triggering the issue was Running rpm-ostree [kargs ...

2. MCD Liveness failure

MCD Liveness failure

[vlours@supportshell-1 ~]$ grep -E "Container machine-config-daemon failed liveness probe, will be restarted" 04319737/0070-before-fix-sosreport-caasntfaac02s2-622cw-ms-worker-t9bx8-04319737-2026-01-15-elaltrv.tar.xz/sosreport-caasntfaac02s2-622cw-ms-worker-t9bx8-04319737-2026-01-15-elaltrv/sos_commands/logs/journalctl_--no-pager_--boot
Jan 15 12:57:13 caasntfaac02s2-622cw-ms-worker-t9bx8 kubenswrapper[3304]: I0115 12:57:13.135447    3304 kuberuntime_manager.go:1025] "Message for Container of pod" containerName="machine-config-daemon" containerStatusID={"Type":"cri-o","ID":"d3a987f37967658bddced56d59a24c822c5fa39b0f932af47e5f37d90cb34174"} pod="openshift-machine-config-operator/machine-config-daemon-xsvb8" containerMessage="Container machine-config-daemon failed liveness probe, will be restarted"
Jan 15 13:00:13 caasntfaac02s2-622cw-ms-worker-t9bx8 kubenswrapper[3304]: I0115 13:00:13.134650    3304 kuberuntime_manager.go:1025] "Message for Container of pod" containerName="machine-config-daemon" containerStatusID={"Type":"cri-o","ID":"131014e4defa319e7d09b2245ecfe4aac725c8b692681132148475d44296b3fe"} pod="openshift-machine-config-operator/machine-config-daemon-xsvb8" containerMessage="Container machine-config-daemon failed liveness probe, will be restarted"
Jan 15 13:03:13 caasntfaac02s2-622cw-ms-worker-t9bx8 kubenswrapper[3304]: I0115 13:03:13.134099    3304 kuberuntime_manager.go:1025] "Message for Container of pod" containerName="machine-config-daemon" containerStatusID={"Type":"cri-o","ID":"d27ec6bc698e8a7a606e2303d1ef679e85a3a687f3d961465338340e1fbc58e5"} pod="openshift-machine-config-operator/machine-config-daemon-xsvb8" containerMessage="Container machine-config-daemon failed liveness probe, will be restarted"

Remediation stopping Kubelet

[vlours@supportshell-1 ~]$ grep -E "reboot|Container machine-config-daemon failed liveness probe|kubelet.service" 04319737/0080-after-fix-sosreport-caasntfaac02s2-622cw-ms-worker-t9bx8-04319737-2026-01-15-uqeafcx.tar.xz/sosreport-caasntfaac02s2-622cw-ms-worker-t9bx8-04319737-2026-01-15-uqeafcx/sos_commands/logs/journalctl_--no-pager_--boot_-1

[...]

Jan 15 13:29:43 caasntfaac02s2-622cw-ms-worker-t9bx8 kubenswrapper[3304]: I0115 13:29:43.135986    3304 kuberuntime_manager.go:1025] "Message for Container of pod" containerName="machine-config-daemon" containerStatusID={"Type":"cri-o","ID":"fcc35f7569c67dfb3bb22b105183b408952ea7237591ba02a0024774a320f173"} pod="openshift-machine-config-operator/machine-config-daemon-xsvb8" containerMessage="Container machine-config-daemon failed liveness probe, will be restarted"
Jan 15 13:35:12 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: kubelet.service: Deactivated successfully.
Jan 15 13:35:12 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: kubelet.service: Consumed 1h 28min 57.089s CPU time.
Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 root[3222868]: machine-config-daemon[3218343]: "initiating reboot: Node will reboot into config rendered-worker-586f3d0ccf00ca3fe223a1d378905089"
Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-586f3d0ccf00ca3fe223a1d378905089.
Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 root[3222871]: machine-config-daemon[3218343]: "reboot successful"
Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd-logind[1324]: The system will reboot now!
Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd-logind[1324]: System is rebooting.
Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: machine-config-daemon-reboot.service: Deactivated successfully.
Jan 15 13:37:46 caasntfaac02s2-622cw-ms-worker-t9bx8 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-worker-586f3d0ccf00ca3fe223a1d378905089.

In this scenario, the issue was related to the image in the file */etc/kubernetes/manifests/coredns.yaml*, already updated with the 4.18 image when the rest of the config seems to be in 4.17

2026-01-15T13:02:18.141925704+11:00 E0115 13:02:18.141897 3144983 writer.go:231] Marking Degraded due to: "unexpected on-disk state validating against rendered-worker-[....]: content mismatch for file \"/etc/kubernetes/manifests/coredns.yaml\""
2026-01-15T13:03:13.140601142+11:00 I0115 13:03:13.140483 3144983 daemon.go:1400] Shutting down MachineConfigDaemon

Version-Release number of selected component (if applicable):

The first scenario occurred when the cluster was after the customer completed successfully their update from 4.16 to 4.17.41, and when they were transitioning the cgroup form v1 to v2.

The second scenario occurred on ALL compute nodes when updating from 4.17.41 to 4.18.26.

How reproducible:

We have tested the 4.17.41 case trying to apply the cgroup's kargs, but we have not been able to reproduce the issue in a STD AWS cluster.
The customer has multiple Operator installed, and this is an vSphere IPI installation.

Steps to Reproduce:

1. Unknown
2.
3.

Actual results:

MCD is running without any issue until the MCO is triggering the new MCP deployment on the node. From there the container is restarting.
In order to continue the MCP deployment on the node, the kubelet.service has to be stopped, meaning that this is certainly related to the liveness probe in both case.

Expected results:

Smooth MCP deployment.

Additional info:

the KCS 7136427 has been created to describe the issue.

relates to

RFE-8753 Modernize Machine Config Daemon health monitoring using Startup/Readiness/Liveness probes

Backlog

links to

KCS solution 7136427

Details

Description

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates