Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.19.0
Affects Version/s: 4.16.z
Component/s: Machine Config Operator
Labels:
- FastFix
- telco

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:

4.17.z, 4.16.z, 4.18.z
Target Version:

4.19.0
Release Blocker:
None
Sprint:
MCO Sprint 268
sprint_count:
1

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Test Coverage:

+

PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, when the MCO rebooted the node too quickly after staging an update, the update failed. With this release, the MCO waits for the staging operation to finish before rebooting the system, allowing the update to complete. (link:https://issues.redhat.com/browse/OCPBUGS-51150[~~OCPBUGS-51150~~])

Show
* Previously, when the MCO rebooted the node too quickly after staging an update, the update failed. With this release, the MCO waits for the staging operation to finish before rebooting the system, allowing the update to complete. (link: https://issues.redhat.com/browse/OCPBUGS-51150 [ OCPBUGS-51150 ])

Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:

TL;DR: I suppose that a combination of resource constraints together with thousands of systemd services could trigger an underlying racy behavior in the MCO.
In the combination of rpm-ostree, the machineconfig operator and when we actually reboot nodes.

Here's the graph of how an rpm-ostree update works according to my understanding (read bottom to top):

Note: Red Hat viewers, please see https://drive.google.com/file/d/1zYirNlsbFsBRNRZ8MVLugdsTsMV6yCQR/view?usp=sharing for an ASCII version that wasn't defigured by Jira's horrible UI ...

                             ostree-finalize-staged-hold.service stopped
                                          ^
                                          |
                                  ostree-finalize-staged-hold.service stopping
                                          ^
                                          |
                                Stopped ostree-finalize-staged.service <------------------------
                                                                                               |
                                                                                               | 
                                Stopping ostree-finalize-staged.service  --(runs)--> /usr/bin/ostree admin finalize-staged -----------------------------(reads)----------------------------|
                                          ^                                                                                                                                                |
                                          |                                                                                                                                                |
                                (on service stopping, e.g. reboot)                                                                                                                         |
                                          |                                                                                                                                                |
                             Finished ostree-finalize-staged.service(is a noop)[active (exited)]   <-----------------(dependency met)-------------------                                   |
                                                                                                                                                       |                                   |
                                                                                                                                                       |                                   |
                          ostree-finalize-staged.service --(depends on, starts and runs)--> ostree-finalize-staged-hold.service starting --> ostree-finalize-staged-hold.service started   |
                                          ^                                                                                                                                                |
                                          |                                                                                                                                                |
                                       (triggers)                                                                                                                                          |                                                                                                                        |
                                          |                                                                                                                                                |
                               ostree-finalize-staged.path                                                                                                                                 |
                                          |                                                                                                                                                |
                                      (monitors)                                                                                                                                           |
                                          |                                                                                                                                                |
                                          v                                                                                                                                                |
rpm-ostree start-daemon --(writes)--> /run/ostree/staged-deployment <----------------------------------------------------------------------------------------------------------------------|
     ^
     |
 (starts and instructs via dbus)
     |
rpm-ostree kargs

In the journal, on a healthy run from a lab of mine, this plays out as:

Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... # <--- comes from ostree-finalize-staged-hold.service, pulled in via dep relationship with ostree-finalize-staged.service
Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment.    # <--- remains running now
Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Finished OSTree Finalize Staged Deployment.                       # <--- is a noop, remains active (exited) - comes from ostree-finalize-staged.service
(...)
Feb 19 15:06:50 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping OSTree Finalize Staged Deployment...                     # <--- stop logic starting now meaining ostree admin finalize-staged
Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: ostree-finalize-staged.service: Deactivated successfully.
Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped OSTree Finalize Staged Deployment.
Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment... # <--- pulled in via dep, starts before, stops after
Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment.    # <--- same

In the attached support case, this logic fails with the following log messages:

Jan 08 00:28:44 worker0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-39e251cb69d9f430a38dd14b0d3bae3c.
Jan 08 00:28:44 worker0 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment...
Jan 08 00:28:44 worker0 root[15374]: machine-config-daemon[10214]: reboot successful
Jan 08 00:28:44 worker0 systemd-logind[1796]: The system will reboot now!
Jan 08 00:28:44 worker0 ovs-vswitchd[1947]: ovs|00075|connmgr|INFO|br-int<->unix#2: 2317 flow_mods in the 7 s starting 10 s ago (2111 adds, 206 deletes)
Jan 08 00:28:44 worker0 kernel: ice 0000:6c:00.0 ens14f0: Setting MAC 36:4a:79:c2:61:e7 on VF 34. VF driver will be reinitialized
Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Reset indication received from the PF
Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Scheduling reset task
Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Removing device
Jan 08 00:28:44 worker0 systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment.
Jan 08 00:28:44 worker0 systemd-logind[1796]: System is rebooting.
Jan 08 00:28:44 worker0 systemd[1]: Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction).
Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed to queue unit startup job: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction).
Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed with result 'resources'.   # <------------- here
(...)
Jan 08 00:29:01 worker0 systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment...
(...)
Jan 08 00:29:01 worker0 systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment.
(...)

Message ` Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start'
is included in transaction).` means that ostree-finalize-staged requires ostree-finalize-staged-hold to be running, but a reboot was triggered right after ostree-finalize-staged-hold completed, and before ostree-finalize-staged resumed its own start sequence again. At that point, the dependency relationship is never fulfilled, and ostree-finalize-staged can never start. However, rpm-ostree applies all changes in the ExecStop of ostree-finalize-staged, and that's why the changes are never applied.

Here's the sequence of what's happening when it's going wrong:
0) ostree-finalize-staged.path sees that /run/ostree/staged-deployment is created and it wants to start ostree-finalize-staged.service
1) ostree-finalize-staged.service wants to run, but requires to start ostree-finalize-staged-hold.service
2) ostree-finalize-staged-hold.service starting as requested by ostree-finalize-staged.service
3) reboot triggered
4) ostree-finalize-staged-hold.service started
5) ----> ASSUMPTION: reboot queues stop job for ostree-finalize-staged-hold.service
6) ostree-finalize-staged.service wants to run, but requires ostree-finalize-staged-hold.service to be started, but it is currently stopping
ostree-finalize-staged.service can never start
7) ostree-finalize-staged.path: Failed with result 'resources' (because what it wanted to start, ostree-finalize-staged.service, can't start)

Regardless of why the systemd hold service is delayed so much that the reboot affects it ...

A race can occur here because rpm-ostree kargs (or any rpm-ostree update operation) blocks only until the rpm-ostree daemon writes /run/ostree/staged-deployment (the changes are staged in this file).
rpm-ostree then returns.
Everything else happens asynchronously: ostree-finalize-staged.path detects that the file was created, it starts ostree-finalize-staged.service which in ExecStop runs /usr/bin/ostree admin finalize-staged.
This last command actually applies the staged changes.
The MCO daemon is actually blocking only on rpm-ostree kargs. It's not waiting for ostree-finalize-staged.service to be active.
Meaning that the MCO will reboot the node immediately after the rpm-ostree kargs changes were staged, but it is not waiting for the actual daemon that applies the changes to be active.
Here lies the potential for a race condition which - I believe - the customer is hitting due to a very specific node configuration.

Whereas I cannot reproduce the exact error message in the customer's journal, it's super easy to reproduce a failure by exploiting this async mechanism.

Spawn a cluster with cluster bot, e.g. `launch 4.16.30 aws,no-spot`, then run the following test:

#!/bin/bash

TIMEOUT=900

echo "#####################################"
echo "Creating MCP for a single worker node"
echo "#####################################"

first_node=$(oc get nodes -l node-role.kubernetes.io/worker= -o name | head -1)
oc label "${first_node}" node-role.kubernetes.io/worker-test=

cat <<EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-test
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-test]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-test: ""
EOF

echo "#####################################"
echo "   Creating reproducer conditions    "
echo "#####################################"

f=$(mktemp)
cat <<EOF>$f
[Service]
ExecStartPre=/bin/bash -c "echo 'exec start pre'; /bin/sleep 15; echo 'exec start pre end'"
EOF

cat <<EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker-test
  name: 99-worker-test-smoketest-prep
spec:
  baseOSExtensionsContainerImage: ""
  config:
    ignition:
      config:
        replace:
          verification: {}
      proxy: {}
      security:
        tls: {}
      timeouts: {}
      version: 3.2.0
    passwd: {}
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,$(cat $f | base64 -w0)
          verification: {}
        group: {}
        mode: 600
        path: /etc/systemd/system/ostree-finalize-staged-hold.service.d/override.conf
EOF

echo "Sleeping for a bit ..."
sleep 60
echo "Waiting for MCP to be updated"
oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s

echo "#####################################"
echo "      Updating kernel arguments      "
echo "#####################################"
cat <<EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker-test
  name: 99-worker-test-kernelarg-smoketest
spec:
  kernelArguments:
    - abc=def
EOF

echo "Sleeping for a bit ..."
sleep 60
echo "Waiting for MCP to be updated"
oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s &
oc wait --for=condition=Degraded=true mcp/worker-test --timeout=${TIMEOUT}s &
wait -n

echo "#####################################"
echo "             End of test             "
echo "#####################################"
oc get mcp

Wnereas the above test delays the ostree-finalize-staged-hold.service by 15 seconds, and it does not reproduce the exact customer error message ...
It still manages to reproduce an issue where the rpm-ostree staged changes are not applied. The 15 seconds sleep is of course extreme.
But it demonstrates that any delay in starting the -hold service, for example, can lead to a race condition which makes the process fail.

With the above test, the MCP will go into degraded state with:

[akaris@workstation 04029338 (master)]$ oc get mcp worker-test -o yaml | grep abc -C4
    type: Updating
  - lastTransitionTime: "2025-02-21T12:13:03Z"
    message: 'Node ip-10-0-107-203.us-west-2.compute.internal is reporting: "unexpected
      on-disk state validating against rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72:
      missing expected kernel arguments: [abc=def]"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2025-02-21T12:13:03Z"

In the journal:

Feb 21 12:09:11 ip-10-0-107-203 root[8530]: machine-config-daemon[2561]: Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def]
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) added; new total=1
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Loaded sysroot
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Locked sysroot
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Initiated txn KernelArgs for client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] connected to transaction progress
Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564171    2259 patch_prober.go:28] interesting pod/loki-promtail-z7ll9 container/promtail namespace/openshift-e2e-loki: Readiness probe status=failure output="Get \"http://10.129.2.3:3101/ready\": context deadline exceeded (Client.Timeout exceeded wh
ile awaiting headers)" start-of-body=
Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564237    2259 prober.go:107] "Probe failed" probeType="Readiness" pod="openshift-e2e-loki/loki-promtail-z7ll9" podUID="c31172d3-6a1e-457a-8e66-c49db072c977" containerName="promtail" probeResult="failure" output="Get \"http://10.129.2.3:3101/ready\":
 context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Feb 21 12:09:16 ip-10-0-107-203 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment...
Feb 21 12:09:16 ip-10-0-107-203 bash[8598]: exec start pre
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Created new deployment /ostree/deploy/rhcos/deploy/7c3910540178cc5e35b9c819a09cb58831b264384e9ff7a071a73a379dc32718.4
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: sanitycheck(/usr/bin/true) successful
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Unlocked sysroot
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] disconnected from transaction progress
Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) vanished; remaining=0
Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: In idle state; will auto-exit in 63 seconds
Feb 21 12:09:17 ip-10-0-107-203 root[8608]: machine-config-daemon[2561]: Rebooting node
Feb 21 12:09:17 ip-10-0-107-203 root[8609]: machine-config-daemon[2561]: initiating reboot: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72
Feb 21 12:09:17 ip-10-0-107-203 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72.
Feb 21 12:09:17 ip-10-0-107-203 root[8612]: machine-config-daemon[2561]: reboot successful
Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: The system will reboot now!
Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: System is rebooting.

And here are the MCO daemon logs:

2025-02-21T12:09:11.464509464+00:00 stderr F I0221 12:09:11.464485    2561 update.go:2641] Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def]
2025-02-21T12:09:11.466273927+00:00 stderr F I0221 12:09:11.466231    2561 update.go:2626] Running: rpm-ostree kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def
2025-02-21T12:09:16.367223780+00:00 stdout F Staging deployment...done
2025-02-21T12:09:17.184102954+00:00 stdout F Changes queued for next boot. Run "systemctl reboot" to start a reboot

So we can see here that the MCO daemon reboots the node even though rpm-ostree isn't ready, yet, because the corresponding daemon is not yet active.

blocks

OCPBUGS-53111 Race condition in rpm-ostree update logic

Closed

is cloned by

OCPBUGS-53111 Race condition in rpm-ostree update logic

Closed

is triggering

MCO-1738 Corrective Measure for OCPBUGS-51150: Race condition in rpm-ostree update logic

Closed

links to

openshift/machine-config-operator#4907: OCPBUGS-51150: daemon: ensure ostree-finalize-staged is started before rebooting

Assignee:: Jonathan Lebon

Reporter:: Andreas Karis

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: Shane Lovern

Votes:: 0 Vote for this issue

Watchers:: 21 Start watching this issue

Created:: 2025/02/21 1:03 PM

Updated:: 2025/10/29 2:09 PM

Resolved:: 2025/06/17 4:54 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates