Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-51150

Race condition in rpm-ostree update logic

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.16.z
    • Critical
    • None
    • MCO Sprint 268
    • 1
    • False
    • Hide

      None

      Show
      None

      TL;DR: I suppose that a combination of resource constraints together with thousands of systemd services could trigger an underlying racy behavior in the MCO.
      In the combination of rpm-ostree, the machineconfig operator and when we actually reboot nodes.

      Here's the graph of how an rpm-ostree update works according to my understanding (read bottom to top):

      Note: Red Hat viewers, please see https://drive.google.com/file/d/1zYirNlsbFsBRNRZ8MVLugdsTsMV6yCQR/view?usp=sharing for an ASCII version that wasn't defigured by Jira's horrible UI ...

                                   ostree-finalize-staged-hold.service stopped
                                                ^
                                                |
                                        ostree-finalize-staged-hold.service stopping
                                                ^
                                                |
                                      Stopped ostree-finalize-staged.service <------------------------
                                                                                                     |
                                                                                                     | 
                                      Stopping ostree-finalize-staged.service  --(runs)--> /usr/bin/ostree admin finalize-staged -----------------------------(reads)----------------------------|
                                                ^                                                                                                                                                |
                                                |                                                                                                                                                |
                                      (on service stopping, e.g. reboot)                                                                                                                         |
                                                |                                                                                                                                                |
                                   Finished ostree-finalize-staged.service(is a noop)[active (exited)]   <-----------------(dependency met)-------------------                                   |
                                                                                                                                                             |                                   |
                                                                                                                                                             |                                   |
                                ostree-finalize-staged.service --(depends on, starts and runs)--> ostree-finalize-staged-hold.service starting --> ostree-finalize-staged-hold.service started   |
                                                ^                                                                                                                                                |
                                                |                                                                                                                                                |
                                             (triggers)                                                                                                                                          |                                                                                                                        |
                                                |                                                                                                                                                |
                                     ostree-finalize-staged.path                                                                                                                                 |
                                                |                                                                                                                                                |
                                            (monitors)                                                                                                                                           |
                                                |                                                                                                                                                |
                                                v                                                                                                                                                |
      rpm-ostree start-daemon --(writes)--> /run/ostree/staged-deployment <----------------------------------------------------------------------------------------------------------------------|
           ^
           |
       (starts and instructs via dbus)
           |
      rpm-ostree kargs
      

      In the journal, on a healthy run from a lab of mine, this plays out as:

      Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... # <--- comes from ostree-finalize-staged-hold.service, pulled in via dep relationship with ostree-finalize-staged.service
      Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment.    # <--- remains running now
      Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Finished OSTree Finalize Staged Deployment.                       # <--- is a noop, remains active (exited) - comes from ostree-finalize-staged.service
      (...)
      Feb 19 15:06:50 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping OSTree Finalize Staged Deployment...                     # <--- stop logic starting now meaining ostree admin finalize-staged
      Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: ostree-finalize-staged.service: Deactivated successfully.
      Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped OSTree Finalize Staged Deployment.
      Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment... # <--- pulled in via dep, starts before, stops after
      Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment.    # <--- same
      

      In the attached support case, this logic fails with the following log messages:

      Jan 08 00:28:44 worker0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-39e251cb69d9f430a38dd14b0d3bae3c.
      Jan 08 00:28:44 worker0 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment...
      Jan 08 00:28:44 worker0 root[15374]: machine-config-daemon[10214]: reboot successful
      Jan 08 00:28:44 worker0 systemd-logind[1796]: The system will reboot now!
      Jan 08 00:28:44 worker0 ovs-vswitchd[1947]: ovs|00075|connmgr|INFO|br-int<->unix#2: 2317 flow_mods in the 7 s starting 10 s ago (2111 adds, 206 deletes)
      Jan 08 00:28:44 worker0 kernel: ice 0000:6c:00.0 ens14f0: Setting MAC 36:4a:79:c2:61:e7 on VF 34. VF driver will be reinitialized
      Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Reset indication received from the PF
      Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Scheduling reset task
      Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Removing device
      Jan 08 00:28:44 worker0 systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment.
      Jan 08 00:28:44 worker0 systemd-logind[1796]: System is rebooting.
      Jan 08 00:28:44 worker0 systemd[1]: Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction).
      Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed to queue unit startup job: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction).
      Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed with result 'resources'.   # <------------- here
      (...)
      Jan 08 00:29:01 worker0 systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment...
      (...)
      Jan 08 00:29:01 worker0 systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment.
      (...)
      

      Message ` Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start'
      is included in transaction).` means that ostree-finalize-staged requires ostree-finalize-staged-hold to be running, but a reboot was triggered right after ostree-finalize-staged-hold completed, and before ostree-finalize-staged resumed its own start sequence again. At that point, the dependency relationship is never fulfilled, and ostree-finalize-staged can never start. However, rpm-ostree applies all changes in the ExecStop of ostree-finalize-staged, and that's why the changes are never applied.

      Here's the sequence of what's happening when it's going wrong:
      0) ostree-finalize-staged.path sees that /run/ostree/staged-deployment is created and it wants to start ostree-finalize-staged.service
      1) ostree-finalize-staged.service wants to run, but requires to start ostree-finalize-staged-hold.service
      2) ostree-finalize-staged-hold.service starting as requested by ostree-finalize-staged.service
      3) reboot triggered
      4) ostree-finalize-staged-hold.service started
      5) ----> ASSUMPTION: reboot queues stop job for ostree-finalize-staged-hold.service
      6) ostree-finalize-staged.service wants to run, but requires ostree-finalize-staged-hold.service to be started, but it is currently stopping
      ostree-finalize-staged.service can never start
      7) ostree-finalize-staged.path: Failed with result 'resources' (because what it wanted to start, ostree-finalize-staged.service, can't start)

      Regardless of why the systemd hold service is delayed so much that the reboot affects it ...

      A race can occur here because rpm-ostree kargs (or any rpm-ostree update operation) blocks only until the rpm-ostree daemon writes /run/ostree/staged-deployment (the changes are staged in this file).
      rpm-ostree then returns.
      Everything else happens asynchronously: ostree-finalize-staged.path detects that the file was created, it starts ostree-finalize-staged.service which in ExecStop runs /usr/bin/ostree admin finalize-staged.
      This last command actually applies the staged changes.
      The MCO daemon is actually blocking only on rpm-ostree kargs. It's not waiting for ostree-finalize-staged.service to be active.
      Meaning that the MCO will reboot the node immediately after the rpm-ostree kargs changes were staged, but it is not waiting for the actual daemon that applies the changes to be active.
      Here lies the potential for a race condition which - I believe - the customer is hitting due to a very specific node configuration.

      Whereas I cannot reproduce the exact error message in the customer's journal, it's super easy to reproduce a failure by exploiting this async mechanism.

      Spawn a cluster with cluster bot, e.g. `launch 4.16.30 aws,no-spot`, then run the following test:

      #!/bin/bash
      
      TIMEOUT=900
      
      echo "#####################################"
      echo "Creating MCP for a single worker node"
      echo "#####################################"
      
      first_node=$(oc get nodes -l node-role.kubernetes.io/worker= -o name | head -1)
      oc label "${first_node}" node-role.kubernetes.io/worker-test=
      
      cat <<EOF | oc apply -f -
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        name: worker-test
      spec:
        machineConfigSelector:
          matchExpressions:
            - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-test]}
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/worker-test: ""
      EOF
      
      echo "#####################################"
      echo "   Creating reproducer conditions    "
      echo "#####################################"
      
      f=$(mktemp)
      cat <<EOF>$f
      [Service]
      ExecStartPre=/bin/bash -c "echo 'exec start pre'; /bin/sleep 15; echo 'exec start pre end'"
      EOF
      
      cat <<EOF | oc apply -f -
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker-test
        name: 99-worker-test-smoketest-prep
      spec:
        baseOSExtensionsContainerImage: ""
        config:
          ignition:
            config:
              replace:
                verification: {}
            proxy: {}
            security:
              tls: {}
            timeouts: {}
            version: 3.2.0
          passwd: {}
          storage:
            files:
            - contents:
                source: data:text/plain;charset=utf-8;base64,$(cat $f | base64 -w0)
                verification: {}
              group: {}
              mode: 600
              path: /etc/systemd/system/ostree-finalize-staged-hold.service.d/override.conf
      EOF
      
      echo "Sleeping for a bit ..."
      sleep 60
      echo "Waiting for MCP to be updated"
      oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s
      
      echo "#####################################"
      echo "      Updating kernel arguments      "
      echo "#####################################"
      cat <<EOF | oc apply -f -
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker-test
        name: 99-worker-test-kernelarg-smoketest
      spec:
        kernelArguments:
          - abc=def
      EOF
      
      echo "Sleeping for a bit ..."
      sleep 60
      echo "Waiting for MCP to be updated"
      oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s &
      oc wait --for=condition=Degraded=true mcp/worker-test --timeout=${TIMEOUT}s &
      wait -n
      
      echo "#####################################"
      echo "             End of test             "
      echo "#####################################"
      oc get mcp
      

      Wnereas the above test delays the ostree-finalize-staged-hold.service by 15 seconds, and it does not reproduce the exact customer error message ...
      It still manages to reproduce an issue where the rpm-ostree staged changes are not applied. The 15 seconds sleep is of course extreme.
      But it demonstrates that any delay in starting the -hold service, for example, can lead to a race condition which makes the process fail.

      With the above test, the MCP will go into degraded state with:

      [akaris@workstation 04029338 (master)]$ oc get mcp worker-test -o yaml | grep abc -C4
          type: Updating
        - lastTransitionTime: "2025-02-21T12:13:03Z"
          message: 'Node ip-10-0-107-203.us-west-2.compute.internal is reporting: "unexpected
            on-disk state validating against rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72:
            missing expected kernel arguments: [abc=def]"'
          reason: 1 nodes are reporting degraded status on sync
          status: "True"
          type: NodeDegraded
        - lastTransitionTime: "2025-02-21T12:13:03Z"
      

      In the journal:

      Feb 21 12:09:11 ip-10-0-107-203 root[8530]: machine-config-daemon[2561]: Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def]
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) added; new total=1
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Loaded sysroot
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Locked sysroot
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Initiated txn KernelArgs for client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0): /org/projectatomic/rpmostree1/rhcos
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] connected to transaction progress
      Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564171    2259 patch_prober.go:28] interesting pod/loki-promtail-z7ll9 container/promtail namespace/openshift-e2e-loki: Readiness probe status=failure output="Get \"http://10.129.2.3:3101/ready\": context deadline exceeded (Client.Timeout exceeded wh
      ile awaiting headers)" start-of-body=
      Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564237    2259 prober.go:107] "Probe failed" probeType="Readiness" pod="openshift-e2e-loki/loki-promtail-z7ll9" podUID="c31172d3-6a1e-457a-8e66-c49db072c977" containerName="promtail" probeResult="failure" output="Get \"http://10.129.2.3:3101/ready\":
       context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
      Feb 21 12:09:16 ip-10-0-107-203 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment...
      Feb 21 12:09:16 ip-10-0-107-203 bash[8598]: exec start pre
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Created new deployment /ostree/deploy/rhcos/deploy/7c3910540178cc5e35b9c819a09cb58831b264384e9ff7a071a73a379dc32718.4
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: sanitycheck(/usr/bin/true) successful
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Unlocked sysroot
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] disconnected from transaction progress
      Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) vanished; remaining=0
      Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: In idle state; will auto-exit in 63 seconds
      Feb 21 12:09:17 ip-10-0-107-203 root[8608]: machine-config-daemon[2561]: Rebooting node
      Feb 21 12:09:17 ip-10-0-107-203 root[8609]: machine-config-daemon[2561]: initiating reboot: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72
      Feb 21 12:09:17 ip-10-0-107-203 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72.
      Feb 21 12:09:17 ip-10-0-107-203 root[8612]: machine-config-daemon[2561]: reboot successful
      Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: The system will reboot now!
      Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: System is rebooting.
      

      And here are the MCO daemon logs:

      2025-02-21T12:09:11.464509464+00:00 stderr F I0221 12:09:11.464485    2561 update.go:2641] Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def]
      2025-02-21T12:09:11.466273927+00:00 stderr F I0221 12:09:11.466231    2561 update.go:2626] Running: rpm-ostree kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def
      2025-02-21T12:09:16.367223780+00:00 stdout F Staging deployment...done
      2025-02-21T12:09:17.184102954+00:00 stdout F Changes queued for next boot. Run "systemctl reboot" to start a reboot
      

      So we can see here that the MCO daemon reboots the node even though rpm-ostree isn't ready, yet, because the corresponding daemon is not yet active.

              jlebon1@redhat.com Jonathan Lebon
              akaris@redhat.com Andreas Karis
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

                Created:
                Updated: