• Critical
    • None
    • MCO Sprint 268
    • 1
    • False
    • Hide

      None

      Show
      None
    • Previously, when the MCO rebooted the node too quickly after staging an update, the update failed. With this release, the MCO waits for the staging operation to finish before rebooting the system, allowing the update to complete.
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-53111. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-51150. The following is the description of the original issue:

      TL;DR: I suppose that a combination of resource constraints together with thousands of systemd services could trigger an underlying racy behavior in the MCO.
      In the combination of rpm-ostree, the machineconfig operator and when we actually reboot nodes.

      Here's the graph of how an rpm-ostree update works according to my understanding (read bottom to top):

      Note: Red Hat viewers, please see https://drive.google.com/file/d/1zYirNlsbFsBRNRZ8MVLugdsTsMV6yCQR/view?usp=sharing for an ASCII version that wasn't defigured by Jira's horrible UI ...

                                   ostree-finalize-staged-hold.service stopped
                                                ^
                                                |
                                        ostree-finalize-staged-hold.service stopping
                                                ^
                                                |
                                      Stopped ostree-finalize-staged.service <------------------------
                                                                                                     |
                                                                                                     | 
                                      Stopping ostree-finalize-staged.service  --(runs)--> /usr/bin/ostree admin finalize-staged -----------------------------(reads)----------------------------|
                                                ^                                                                                                                                                |
                                                |                                                                                                                                                |
                                      (on service stopping, e.g. reboot)                                                                                                                         |
                                                |                                                                                                                                                |
                                   Finished ostree-finalize-staged.service(is a noop)[active (exited)]   <-----------------(dependency met)-------------------                                   |
                                                                                                                                                             |                                   |
                                                                                                                                                             |                                   |
                                ostree-finalize-staged.service --(depends on, starts and runs)--> ostree-finalize-staged-hold.service starting --> ostree-finalize-staged-hold.service started   |
                                                ^                                                                                                                                                |
                                                |                                                                                                                                                |
                                             (triggers)                                                                                                                                          |                                                                                                                        |
                                                |                                                                                                                                                |
                                     ostree-finalize-staged.path                                                                                                                                 |
                                                |                                                                                                                                                |
                                            (monitors)                                                                                                                                           |
                                                |                                                                                                                                                |
                                                v                                                                                                                                                |
      rpm-ostree start-daemon --(writes)--> /run/ostree/staged-deployment <----------------------------------------------------------------------------------------------------------------------|
           ^
           |
       (starts and instructs via dbus)
           |
      rpm-ostree kargs
      

      In the journal, on a healthy run from a lab of mine, this plays out as:

      Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... # <--- comes from ostree-finalize-staged-hold.service, pulled in via dep relationship with ostree-finalize-staged.service
      Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment.    # <--- remains running now
      Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Finished OSTree Finalize Staged Deployment.                       # <--- is a noop, remains active (exited) - comes from ostree-finalize-staged.service
      (...)
      Feb 19 15:06:50 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping OSTree Finalize Staged Deployment...                     # <--- stop logic starting now meaining ostree admin finalize-staged
      Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: ostree-finalize-staged.service: Deactivated successfully.
      Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped OSTree Finalize Staged Deployment.
      Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment... # <--- pulled in via dep, starts before, stops after
      Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment.    # <--- same
      

      In the attached support case, this logic fails with the following log messages:

      Jan 08 00:28:44 worker0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-39e251cb69d9f430a38dd14b0d3bae3c.
      Jan 08 00:28:44 worker0 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment...
      Jan 08 00:28:44 worker0 root[15374]: machine-config-daemon[10214]: reboot successful
      Jan 08 00:28:44 worker0 systemd-logind[1796]: The system will reboot now!
      Jan 08 00:28:44 worker0 ovs-vswitchd[1947]: ovs|00075|connmgr|INFO|br-int<->unix#2: 2317 flow_mods in the 7 s starting 10 s ago (2111 adds, 206 deletes)
      Jan 08 00:28:44 worker0 kernel: ice 0000:6c:00.0 ens14f0: Setting MAC 36:4a:79:c2:61:e7 on VF 34. VF driver will be reinitialized
      Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Reset indication received from the PF
      Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Scheduling reset task
      Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Removing device
      Jan 08 00:28:44 worker0 systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment.
      Jan 08 00:28:44 worker0 systemd-logind[1796]: System is rebooting.
      Jan 08 00:28:44 worker0 systemd[1]: Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction).
      Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed to queue unit startup job: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction).
      Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed with result 'resources'.   # <------------- here
      (...)
      Jan 08 00:29:01 worker0 systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment...
      (...)
      Jan 08 00:29:01 worker0 systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment.
      (...)
      

      Message ` Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start'
      is included in transaction).` means that ostree-finalize-staged requires ostree-finalize-staged-hold to be running, but a reboot was triggered right after ostree-finalize-staged-hold completed, and before ostree-finalize-staged resumed its own start sequence again. At that point, the dependency relationship is never fulfilled, and ostree-finalize-staged can never start. However, rpm-ostree applies all changes in the ExecStop of ostree-finalize-staged, and that's why the changes are never applied.

      Here's the sequence of what's happening when it's going wrong:
      0) ostree-finalize-staged.path sees that /run/ostree/staged-deployment is created and it wants to start ostree-finalize-staged.service
      1) ostree-finalize-staged.service wants to run, but requires to start ostree-finalize-staged-hold.service
      2) ostree-finalize-staged-hold.service starting as requested by ostree-finalize-staged.service
      3) reboot triggered
      4) ostree-finalize-staged-hold.service started
      5) ----> ASSUMPTION: reboot queues stop job for ostree-finalize-staged-hold.service
      6) ostree-finalize-staged.service wants to run, but requires ostree-finalize-staged-hold.service to be started, but it is currently stopping
      ostree-finalize-staged.service can never start
      7) ostree-finalize-staged.path: Failed with result 'resources' (because what it wanted to start, ostree-finalize-staged.service, can't start)

      Regardless of why the systemd hold service is delayed so much that the reboot affects it ...

      A race can occur here because rpm-ostree kargs (or any rpm-ostree update operation) blocks only until the rpm-ostree daemon writes /run/ostree/staged-deployment (the changes are staged in this file).
      rpm-ostree then returns.
      Everything else happens asynchronously: ostree-finalize-staged.path detects that the file was created, it starts ostree-finalize-staged.service which in ExecStop runs /usr/bin/ostree admin finalize-staged.
      This last command actually applies the staged changes.
      The MCO daemon is actually blocking only on rpm-ostree kargs. It's not waiting for ostree-finalize-staged.service to be active.
      Meaning that the MCO will reboot the node immediately after the rpm-ostree kargs changes were staged, but it is not waiting for the actual daemon that applies the changes to be active.
      Here lies the potential for a race condition which - I believe - the customer is hitting due to a very specific node configuration.

      Whereas I cannot reproduce the exact error message in the customer's journal, it's super easy to reproduce a failure by exploiting this async mechanism.

      Spawn a cluster with cluster bot, e.g. `launch 4.16.30 aws,no-spot`, then run the following test:

      #!/bin/bash
      
      TIMEOUT=900
      
      echo "#####################################"
      echo "Creating MCP for a single worker node"
      echo "#####################################"
      
      first_node=$(oc get nodes -l node-role.kubernetes.io/worker= -o name | head -1)
      oc label "${first_node}" node-role.kubernetes.io/worker-test=
      
      cat <<EOF | oc apply -f -
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        name: worker-test
      spec:
        machineConfigSelector:
          matchExpressions:
            - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-test]}
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/worker-test: ""
      EOF
      
      echo "#####################################"
      echo "   Creating reproducer conditions    "
      echo "#####################################"
      
      f=$(mktemp)
      cat <<EOF>$f
      [Service]
      ExecStartPre=/bin/bash -c "echo 'exec start pre'; /bin/sleep 15; echo 'exec start pre end'"
      EOF
      
      cat <<EOF | oc apply -f -
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker-test
        name: 99-worker-test-smoketest-prep
      spec:
        baseOSExtensionsContainerImage: ""
        config:
          ignition:
            config:
              replace:
                verification: {}
            proxy: {}
            security:
              tls: {}
            timeouts: {}
            version: 3.2.0
          passwd: {}
          storage:
            files:
            - contents:
                source: data:text/plain;charset=utf-8;base64,$(cat $f | base64 -w0)
                verification: {}
              group: {}
              mode: 600
              path: /etc/systemd/system/ostree-finalize-staged-hold.service.d/override.conf
      EOF
      
      echo "Sleeping for a bit ..."
      sleep 60
      echo "Waiting for MCP to be updated"
      oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s
      
      echo "#####################################"
      echo "      Updating kernel arguments      "
      echo "#####################################"
      cat <<EOF | oc apply -f -
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker-test
        name: 99-worker-test-kernelarg-smoketest
      spec:
        kernelArguments:
          - abc=def
      EOF
      
      echo "Sleeping for a bit ..."
      sleep 60
      echo "Waiting for MCP to be updated"
      oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s &
      oc wait --for=condition=Degraded=true mcp/worker-test --timeout=${TIMEOUT}s &
      wait -n
      
      echo "#####################################"
      echo "             End of test             "
      echo "#####################################"
      oc get mcp
      

      Wnereas the above test delays the ostree-finalize-staged-hold.service by 15 seconds, and it does not reproduce the exact customer error message ...
      It still manages to reproduce an issue where the rpm-ostree staged changes are not applied. The 15 seconds sleep is of course extreme.
      But it demonstrates that any delay in starting the -hold service, for example, can lead to a race condition which makes the process fail.

      With the above test, the MCP will go into degraded state with:

      [akaris@workstation 04029338 (master)]$ oc get mcp worker-test -o yaml | grep abc -C4
          type: Updating
        - lastTransitionTime: "2025-02-21T12:13:03Z"
          message: 'Node ip-10-0-107-203.us-west-2.compute.internal is reporting: "unexpected
            on-disk state validating against rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72:
            missing expected kernel arguments: [abc=def]"'
          reason: 1 nodes are reporting degraded status on sync
          status: "True"
          type: NodeDegraded
        - lastTransitionTime: "2025-02-21T12:13:03Z"
      

      In the journal:

      Feb 21 12:09:11 ip-10-0-107-203 root[8530]: machine-config-daemon[2561]: Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def]
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) added; new total=1
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Loaded sysroot
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Locked sysroot
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Initiated txn KernelArgs for client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0): /org/projectatomic/rpmostree1/rhcos
      Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] connected to transaction progress
      Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564171    2259 patch_prober.go:28] interesting pod/loki-promtail-z7ll9 container/promtail namespace/openshift-e2e-loki: Readiness probe status=failure output="Get \"http://10.129.2.3:3101/ready\": context deadline exceeded (Client.Timeout exceeded wh
      ile awaiting headers)" start-of-body=
      Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564237    2259 prober.go:107] "Probe failed" probeType="Readiness" pod="openshift-e2e-loki/loki-promtail-z7ll9" podUID="c31172d3-6a1e-457a-8e66-c49db072c977" containerName="promtail" probeResult="failure" output="Get \"http://10.129.2.3:3101/ready\":
       context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
      Feb 21 12:09:16 ip-10-0-107-203 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment...
      Feb 21 12:09:16 ip-10-0-107-203 bash[8598]: exec start pre
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Created new deployment /ostree/deploy/rhcos/deploy/7c3910540178cc5e35b9c819a09cb58831b264384e9ff7a071a73a379dc32718.4
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: sanitycheck(/usr/bin/true) successful
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Unlocked sysroot
      Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] disconnected from transaction progress
      Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) vanished; remaining=0
      Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: In idle state; will auto-exit in 63 seconds
      Feb 21 12:09:17 ip-10-0-107-203 root[8608]: machine-config-daemon[2561]: Rebooting node
      Feb 21 12:09:17 ip-10-0-107-203 root[8609]: machine-config-daemon[2561]: initiating reboot: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72
      Feb 21 12:09:17 ip-10-0-107-203 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72.
      Feb 21 12:09:17 ip-10-0-107-203 root[8612]: machine-config-daemon[2561]: reboot successful
      Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: The system will reboot now!
      Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: System is rebooting.
      

      And here are the MCO daemon logs:

      2025-02-21T12:09:11.464509464+00:00 stderr F I0221 12:09:11.464485    2561 update.go:2641] Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def]
      2025-02-21T12:09:11.466273927+00:00 stderr F I0221 12:09:11.466231    2561 update.go:2626] Running: rpm-ostree kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def
      2025-02-21T12:09:16.367223780+00:00 stdout F Staging deployment...done
      2025-02-21T12:09:17.184102954+00:00 stdout F Changes queued for next boot. Run "systemctl reboot" to start a reboot
      

      So we can see here that the MCO daemon reboots the node even though rpm-ostree isn't ready, yet, because the corresponding daemon is not yet active.

              team-mco Team MCO
              openshift-crt-jira-prow OpenShift Prow Bot
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: