-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.16.z
TL;DR: I suppose that a combination of resource constraints together with thousands of systemd services could trigger an underlying racy behavior in the MCO.
In the combination of rpm-ostree, the machineconfig operator and when we actually reboot nodes.
Here's the graph of how an rpm-ostree update works according to my understanding (read bottom to top):
Note: Red Hat viewers, please see https://drive.google.com/file/d/1zYirNlsbFsBRNRZ8MVLugdsTsMV6yCQR/view?usp=sharing for an ASCII version that wasn't defigured by Jira's horrible UI ...
ostree-finalize-staged-hold.service stopped ^ | ostree-finalize-staged-hold.service stopping ^ | Stopped ostree-finalize-staged.service <------------------------ | | Stopping ostree-finalize-staged.service --(runs)--> /usr/bin/ostree admin finalize-staged -----------------------------(reads)----------------------------| ^ | | | (on service stopping, e.g. reboot) | | | Finished ostree-finalize-staged.service(is a noop)[active (exited)] <-----------------(dependency met)------------------- | | | | | ostree-finalize-staged.service --(depends on, starts and runs)--> ostree-finalize-staged-hold.service starting --> ostree-finalize-staged-hold.service started | ^ | | | (triggers) | | | | ostree-finalize-staged.path | | | (monitors) | | | v | rpm-ostree start-daemon --(writes)--> /run/ostree/staged-deployment <----------------------------------------------------------------------------------------------------------------------| ^ | (starts and instructs via dbus) | rpm-ostree kargs
In the journal, on a healthy run from a lab of mine, this plays out as:
Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... # <--- comes from ostree-finalize-staged-hold.service, pulled in via dep relationship with ostree-finalize-staged.service Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment. # <--- remains running now Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Finished OSTree Finalize Staged Deployment. # <--- is a noop, remains active (exited) - comes from ostree-finalize-staged.service (...) Feb 19 15:06:50 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping OSTree Finalize Staged Deployment... # <--- stop logic starting now meaining ostree admin finalize-staged Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: ostree-finalize-staged.service: Deactivated successfully. Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped OSTree Finalize Staged Deployment. Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment... # <--- pulled in via dep, starts before, stops after Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment. # <--- same
In the attached support case, this logic fails with the following log messages:
Jan 08 00:28:44 worker0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-39e251cb69d9f430a38dd14b0d3bae3c. Jan 08 00:28:44 worker0 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... Jan 08 00:28:44 worker0 root[15374]: machine-config-daemon[10214]: reboot successful Jan 08 00:28:44 worker0 systemd-logind[1796]: The system will reboot now! Jan 08 00:28:44 worker0 ovs-vswitchd[1947]: ovs|00075|connmgr|INFO|br-int<->unix#2: 2317 flow_mods in the 7 s starting 10 s ago (2111 adds, 206 deletes) Jan 08 00:28:44 worker0 kernel: ice 0000:6c:00.0 ens14f0: Setting MAC 36:4a:79:c2:61:e7 on VF 34. VF driver will be reinitialized Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Reset indication received from the PF Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Scheduling reset task Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Removing device Jan 08 00:28:44 worker0 systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment. Jan 08 00:28:44 worker0 systemd-logind[1796]: System is rebooting. Jan 08 00:28:44 worker0 systemd[1]: Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction). Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed to queue unit startup job: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction). Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed with result 'resources'. # <------------- here (...) Jan 08 00:29:01 worker0 systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment... (...) Jan 08 00:29:01 worker0 systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment. (...)
Message ` Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start'
is included in transaction).` means that ostree-finalize-staged requires ostree-finalize-staged-hold to be running, but a reboot was triggered right after ostree-finalize-staged-hold completed, and before ostree-finalize-staged resumed its own start sequence again. At that point, the dependency relationship is never fulfilled, and ostree-finalize-staged can never start. However, rpm-ostree applies all changes in the ExecStop of ostree-finalize-staged, and that's why the changes are never applied.
Here's the sequence of what's happening when it's going wrong:
0) ostree-finalize-staged.path sees that /run/ostree/staged-deployment is created and it wants to start ostree-finalize-staged.service
1) ostree-finalize-staged.service wants to run, but requires to start ostree-finalize-staged-hold.service
2) ostree-finalize-staged-hold.service starting as requested by ostree-finalize-staged.service
3) reboot triggered
4) ostree-finalize-staged-hold.service started
5) ----> ASSUMPTION: reboot queues stop job for ostree-finalize-staged-hold.service
6) ostree-finalize-staged.service wants to run, but requires ostree-finalize-staged-hold.service to be started, but it is currently stopping
ostree-finalize-staged.service can never start
7) ostree-finalize-staged.path: Failed with result 'resources' (because what it wanted to start, ostree-finalize-staged.service, can't start)
Regardless of why the systemd hold service is delayed so much that the reboot affects it ...
A race can occur here because rpm-ostree kargs (or any rpm-ostree update operation) blocks only until the rpm-ostree daemon writes /run/ostree/staged-deployment (the changes are staged in this file).
rpm-ostree then returns.
Everything else happens asynchronously: ostree-finalize-staged.path detects that the file was created, it starts ostree-finalize-staged.service which in ExecStop runs /usr/bin/ostree admin finalize-staged.
This last command actually applies the staged changes.
The MCO daemon is actually blocking only on rpm-ostree kargs. It's not waiting for ostree-finalize-staged.service to be active.
Meaning that the MCO will reboot the node immediately after the rpm-ostree kargs changes were staged, but it is not waiting for the actual daemon that applies the changes to be active.
Here lies the potential for a race condition which - I believe - the customer is hitting due to a very specific node configuration.
Whereas I cannot reproduce the exact error message in the customer's journal, it's super easy to reproduce a failure by exploiting this async mechanism.
Spawn a cluster with cluster bot, e.g. `launch 4.16.30 aws,no-spot`, then run the following test:
#!/bin/bash TIMEOUT=900 echo "#####################################" echo "Creating MCP for a single worker node" echo "#####################################" first_node=$(oc get nodes -l node-role.kubernetes.io/worker= -o name | head -1) oc label "${first_node}" node-role.kubernetes.io/worker-test= cat <<EOF | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-test spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-test]} nodeSelector: matchLabels: node-role.kubernetes.io/worker-test: "" EOF echo "#####################################" echo " Creating reproducer conditions " echo "#####################################" f=$(mktemp) cat <<EOF>$f [Service] ExecStartPre=/bin/bash -c "echo 'exec start pre'; /bin/sleep 15; echo 'exec start pre end'" EOF cat <<EOF | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker-test name: 99-worker-test-smoketest-prep spec: baseOSExtensionsContainerImage: "" config: ignition: config: replace: verification: {} proxy: {} security: tls: {} timeouts: {} version: 3.2.0 passwd: {} storage: files: - contents: source: data:text/plain;charset=utf-8;base64,$(cat $f | base64 -w0) verification: {} group: {} mode: 600 path: /etc/systemd/system/ostree-finalize-staged-hold.service.d/override.conf EOF echo "Sleeping for a bit ..." sleep 60 echo "Waiting for MCP to be updated" oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s echo "#####################################" echo " Updating kernel arguments " echo "#####################################" cat <<EOF | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker-test name: 99-worker-test-kernelarg-smoketest spec: kernelArguments: - abc=def EOF echo "Sleeping for a bit ..." sleep 60 echo "Waiting for MCP to be updated" oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s & oc wait --for=condition=Degraded=true mcp/worker-test --timeout=${TIMEOUT}s & wait -n echo "#####################################" echo " End of test " echo "#####################################" oc get mcp
Wnereas the above test delays the ostree-finalize-staged-hold.service by 15 seconds, and it does not reproduce the exact customer error message ...
It still manages to reproduce an issue where the rpm-ostree staged changes are not applied. The 15 seconds sleep is of course extreme.
But it demonstrates that any delay in starting the -hold service, for example, can lead to a race condition which makes the process fail.
With the above test, the MCP will go into degraded state with:
[akaris@workstation 04029338 (master)]$ oc get mcp worker-test -o yaml | grep abc -C4 type: Updating - lastTransitionTime: "2025-02-21T12:13:03Z" message: 'Node ip-10-0-107-203.us-west-2.compute.internal is reporting: "unexpected on-disk state validating against rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72: missing expected kernel arguments: [abc=def]"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded - lastTransitionTime: "2025-02-21T12:13:03Z"
In the journal:
Feb 21 12:09:11 ip-10-0-107-203 root[8530]: machine-config-daemon[2561]: Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def] Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) added; new total=1 Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Loaded sysroot Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Locked sysroot Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Initiated txn KernelArgs for client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0): /org/projectatomic/rpmostree1/rhcos Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] connected to transaction progress Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564171 2259 patch_prober.go:28] interesting pod/loki-promtail-z7ll9 container/promtail namespace/openshift-e2e-loki: Readiness probe status=failure output="Get \"http://10.129.2.3:3101/ready\": context deadline exceeded (Client.Timeout exceeded wh ile awaiting headers)" start-of-body= Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564237 2259 prober.go:107] "Probe failed" probeType="Readiness" pod="openshift-e2e-loki/loki-promtail-z7ll9" podUID="c31172d3-6a1e-457a-8e66-c49db072c977" containerName="promtail" probeResult="failure" output="Get \"http://10.129.2.3:3101/ready\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" Feb 21 12:09:16 ip-10-0-107-203 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... Feb 21 12:09:16 ip-10-0-107-203 bash[8598]: exec start pre Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Created new deployment /ostree/deploy/rhcos/deploy/7c3910540178cc5e35b9c819a09cb58831b264384e9ff7a071a73a379dc32718.4 Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: sanitycheck(/usr/bin/true) successful Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Unlocked sysroot Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] disconnected from transaction progress Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) vanished; remaining=0 Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: In idle state; will auto-exit in 63 seconds Feb 21 12:09:17 ip-10-0-107-203 root[8608]: machine-config-daemon[2561]: Rebooting node Feb 21 12:09:17 ip-10-0-107-203 root[8609]: machine-config-daemon[2561]: initiating reboot: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72 Feb 21 12:09:17 ip-10-0-107-203 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72. Feb 21 12:09:17 ip-10-0-107-203 root[8612]: machine-config-daemon[2561]: reboot successful Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: The system will reboot now! Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: System is rebooting.
And here are the MCO daemon logs:
2025-02-21T12:09:11.464509464+00:00 stderr F I0221 12:09:11.464485 2561 update.go:2641] Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def] 2025-02-21T12:09:11.466273927+00:00 stderr F I0221 12:09:11.466231 2561 update.go:2626] Running: rpm-ostree kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def 2025-02-21T12:09:16.367223780+00:00 stdout F Staging deployment...done 2025-02-21T12:09:17.184102954+00:00 stdout F Changes queued for next boot. Run "systemctl reboot" to start a reboot
So we can see here that the MCO daemon reboots the node even though rpm-ostree isn't ready, yet, because the corresponding daemon is not yet active.