-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.16.z
-
Critical
-
None
-
MCO Sprint 268
-
1
-
False
-
-
Previously, when the MCO rebooted the node too quickly after staging an update, the update failed. With this release, the MCO waits for the staging operation to finish before rebooting the system, allowing the update to complete.
-
Bug Fix
-
Done
-
This is a clone of issue OCPBUGS-53111. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-51150. The following is the description of the original issue:
—
TL;DR: I suppose that a combination of resource constraints together with thousands of systemd services could trigger an underlying racy behavior in the MCO.
In the combination of rpm-ostree, the machineconfig operator and when we actually reboot nodes.
Here's the graph of how an rpm-ostree update works according to my understanding (read bottom to top):
Note: Red Hat viewers, please see https://drive.google.com/file/d/1zYirNlsbFsBRNRZ8MVLugdsTsMV6yCQR/view?usp=sharing for an ASCII version that wasn't defigured by Jira's horrible UI ...
ostree-finalize-staged-hold.service stopped ^ | ostree-finalize-staged-hold.service stopping ^ | Stopped ostree-finalize-staged.service <------------------------ | | Stopping ostree-finalize-staged.service --(runs)--> /usr/bin/ostree admin finalize-staged -----------------------------(reads)----------------------------| ^ | | | (on service stopping, e.g. reboot) | | | Finished ostree-finalize-staged.service(is a noop)[active (exited)] <-----------------(dependency met)------------------- | | | | | ostree-finalize-staged.service --(depends on, starts and runs)--> ostree-finalize-staged-hold.service starting --> ostree-finalize-staged-hold.service started | ^ | | | (triggers) | | | | ostree-finalize-staged.path | | | (monitors) | | | v | rpm-ostree start-daemon --(writes)--> /run/ostree/staged-deployment <----------------------------------------------------------------------------------------------------------------------| ^ | (starts and instructs via dbus) | rpm-ostree kargs
In the journal, on a healthy run from a lab of mine, this plays out as:
Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... # <--- comes from ostree-finalize-staged-hold.service, pulled in via dep relationship with ostree-finalize-staged.service Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment. # <--- remains running now Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Finished OSTree Finalize Staged Deployment. # <--- is a noop, remains active (exited) - comes from ostree-finalize-staged.service (...) Feb 19 15:06:50 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping OSTree Finalize Staged Deployment... # <--- stop logic starting now meaining ostree admin finalize-staged Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: ostree-finalize-staged.service: Deactivated successfully. Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped OSTree Finalize Staged Deployment. Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment... # <--- pulled in via dep, starts before, stops after Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment. # <--- same
In the attached support case, this logic fails with the following log messages:
Jan 08 00:28:44 worker0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-39e251cb69d9f430a38dd14b0d3bae3c. Jan 08 00:28:44 worker0 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... Jan 08 00:28:44 worker0 root[15374]: machine-config-daemon[10214]: reboot successful Jan 08 00:28:44 worker0 systemd-logind[1796]: The system will reboot now! Jan 08 00:28:44 worker0 ovs-vswitchd[1947]: ovs|00075|connmgr|INFO|br-int<->unix#2: 2317 flow_mods in the 7 s starting 10 s ago (2111 adds, 206 deletes) Jan 08 00:28:44 worker0 kernel: ice 0000:6c:00.0 ens14f0: Setting MAC 36:4a:79:c2:61:e7 on VF 34. VF driver will be reinitialized Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Reset indication received from the PF Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Scheduling reset task Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Removing device Jan 08 00:28:44 worker0 systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment. Jan 08 00:28:44 worker0 systemd-logind[1796]: System is rebooting. Jan 08 00:28:44 worker0 systemd[1]: Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction). Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed to queue unit startup job: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction). Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed with result 'resources'. # <------------- here (...) Jan 08 00:29:01 worker0 systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment... (...) Jan 08 00:29:01 worker0 systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment. (...)
Message ` Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start'
is included in transaction).` means that ostree-finalize-staged requires ostree-finalize-staged-hold to be running, but a reboot was triggered right after ostree-finalize-staged-hold completed, and before ostree-finalize-staged resumed its own start sequence again. At that point, the dependency relationship is never fulfilled, and ostree-finalize-staged can never start. However, rpm-ostree applies all changes in the ExecStop of ostree-finalize-staged, and that's why the changes are never applied.
Here's the sequence of what's happening when it's going wrong:
0) ostree-finalize-staged.path sees that /run/ostree/staged-deployment is created and it wants to start ostree-finalize-staged.service
1) ostree-finalize-staged.service wants to run, but requires to start ostree-finalize-staged-hold.service
2) ostree-finalize-staged-hold.service starting as requested by ostree-finalize-staged.service
3) reboot triggered
4) ostree-finalize-staged-hold.service started
5) ----> ASSUMPTION: reboot queues stop job for ostree-finalize-staged-hold.service
6) ostree-finalize-staged.service wants to run, but requires ostree-finalize-staged-hold.service to be started, but it is currently stopping
ostree-finalize-staged.service can never start
7) ostree-finalize-staged.path: Failed with result 'resources' (because what it wanted to start, ostree-finalize-staged.service, can't start)
Regardless of why the systemd hold service is delayed so much that the reboot affects it ...
A race can occur here because rpm-ostree kargs (or any rpm-ostree update operation) blocks only until the rpm-ostree daemon writes /run/ostree/staged-deployment (the changes are staged in this file).
rpm-ostree then returns.
Everything else happens asynchronously: ostree-finalize-staged.path detects that the file was created, it starts ostree-finalize-staged.service which in ExecStop runs /usr/bin/ostree admin finalize-staged.
This last command actually applies the staged changes.
The MCO daemon is actually blocking only on rpm-ostree kargs. It's not waiting for ostree-finalize-staged.service to be active.
Meaning that the MCO will reboot the node immediately after the rpm-ostree kargs changes were staged, but it is not waiting for the actual daemon that applies the changes to be active.
Here lies the potential for a race condition which - I believe - the customer is hitting due to a very specific node configuration.
Whereas I cannot reproduce the exact error message in the customer's journal, it's super easy to reproduce a failure by exploiting this async mechanism.
Spawn a cluster with cluster bot, e.g. `launch 4.16.30 aws,no-spot`, then run the following test:
#!/bin/bash TIMEOUT=900 echo "#####################################" echo "Creating MCP for a single worker node" echo "#####################################" first_node=$(oc get nodes -l node-role.kubernetes.io/worker= -o name | head -1) oc label "${first_node}" node-role.kubernetes.io/worker-test= cat <<EOF | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-test spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-test]} nodeSelector: matchLabels: node-role.kubernetes.io/worker-test: "" EOF echo "#####################################" echo " Creating reproducer conditions " echo "#####################################" f=$(mktemp) cat <<EOF>$f [Service] ExecStartPre=/bin/bash -c "echo 'exec start pre'; /bin/sleep 15; echo 'exec start pre end'" EOF cat <<EOF | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker-test name: 99-worker-test-smoketest-prep spec: baseOSExtensionsContainerImage: "" config: ignition: config: replace: verification: {} proxy: {} security: tls: {} timeouts: {} version: 3.2.0 passwd: {} storage: files: - contents: source: data:text/plain;charset=utf-8;base64,$(cat $f | base64 -w0) verification: {} group: {} mode: 600 path: /etc/systemd/system/ostree-finalize-staged-hold.service.d/override.conf EOF echo "Sleeping for a bit ..." sleep 60 echo "Waiting for MCP to be updated" oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s echo "#####################################" echo " Updating kernel arguments " echo "#####################################" cat <<EOF | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker-test name: 99-worker-test-kernelarg-smoketest spec: kernelArguments: - abc=def EOF echo "Sleeping for a bit ..." sleep 60 echo "Waiting for MCP to be updated" oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s & oc wait --for=condition=Degraded=true mcp/worker-test --timeout=${TIMEOUT}s & wait -n echo "#####################################" echo " End of test " echo "#####################################" oc get mcp
Wnereas the above test delays the ostree-finalize-staged-hold.service by 15 seconds, and it does not reproduce the exact customer error message ...
It still manages to reproduce an issue where the rpm-ostree staged changes are not applied. The 15 seconds sleep is of course extreme.
But it demonstrates that any delay in starting the -hold service, for example, can lead to a race condition which makes the process fail.
With the above test, the MCP will go into degraded state with:
[akaris@workstation 04029338 (master)]$ oc get mcp worker-test -o yaml | grep abc -C4 type: Updating - lastTransitionTime: "2025-02-21T12:13:03Z" message: 'Node ip-10-0-107-203.us-west-2.compute.internal is reporting: "unexpected on-disk state validating against rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72: missing expected kernel arguments: [abc=def]"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded - lastTransitionTime: "2025-02-21T12:13:03Z"
In the journal:
Feb 21 12:09:11 ip-10-0-107-203 root[8530]: machine-config-daemon[2561]: Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def] Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) added; new total=1 Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Loaded sysroot Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Locked sysroot Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Initiated txn KernelArgs for client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0): /org/projectatomic/rpmostree1/rhcos Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] connected to transaction progress Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564171 2259 patch_prober.go:28] interesting pod/loki-promtail-z7ll9 container/promtail namespace/openshift-e2e-loki: Readiness probe status=failure output="Get \"http://10.129.2.3:3101/ready\": context deadline exceeded (Client.Timeout exceeded wh ile awaiting headers)" start-of-body= Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564237 2259 prober.go:107] "Probe failed" probeType="Readiness" pod="openshift-e2e-loki/loki-promtail-z7ll9" podUID="c31172d3-6a1e-457a-8e66-c49db072c977" containerName="promtail" probeResult="failure" output="Get \"http://10.129.2.3:3101/ready\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" Feb 21 12:09:16 ip-10-0-107-203 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... Feb 21 12:09:16 ip-10-0-107-203 bash[8598]: exec start pre Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Created new deployment /ostree/deploy/rhcos/deploy/7c3910540178cc5e35b9c819a09cb58831b264384e9ff7a071a73a379dc32718.4 Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: sanitycheck(/usr/bin/true) successful Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Unlocked sysroot Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] disconnected from transaction progress Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) vanished; remaining=0 Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: In idle state; will auto-exit in 63 seconds Feb 21 12:09:17 ip-10-0-107-203 root[8608]: machine-config-daemon[2561]: Rebooting node Feb 21 12:09:17 ip-10-0-107-203 root[8609]: machine-config-daemon[2561]: initiating reboot: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72 Feb 21 12:09:17 ip-10-0-107-203 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72. Feb 21 12:09:17 ip-10-0-107-203 root[8612]: machine-config-daemon[2561]: reboot successful Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: The system will reboot now! Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: System is rebooting.
And here are the MCO daemon logs:
2025-02-21T12:09:11.464509464+00:00 stderr F I0221 12:09:11.464485 2561 update.go:2641] Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def] 2025-02-21T12:09:11.466273927+00:00 stderr F I0221 12:09:11.466231 2561 update.go:2626] Running: rpm-ostree kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def 2025-02-21T12:09:16.367223780+00:00 stdout F Staging deployment...done 2025-02-21T12:09:17.184102954+00:00 stdout F Changes queued for next boot. Run "systemctl reboot" to start a reboot
So we can see here that the MCO daemon reboots the node even though rpm-ostree isn't ready, yet, because the corresponding daemon is not yet active.
- blocks
-
OCPBUGS-53313 Race condition in rpm-ostree update logic
-
- Closed
-
- clones
-
OCPBUGS-53111 Race condition in rpm-ostree update logic
-
- Closed
-
- is blocked by
-
OCPBUGS-53111 Race condition in rpm-ostree update logic
-
- Closed
-
- is cloned by
-
OCPBUGS-53313 Race condition in rpm-ostree update logic
-
- Closed
-
- links to
-
RHSA-2025:3059 OpenShift Container Platform 4.17.22 bug fix and security update