Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-9685

fails to reconcile to RT kernel on interrupted updates

    • No
    • Approved
    • False
    • Hide

      None

      Show
      None

      The aggregated https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-ovn-rt-upgrade-4.14-minor-release-openshift-release-analysis-aggregator/1633554110798106624 job failed.  Digging into one of them:

       

      This MCD log has https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1633554106595414016/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-p2vf4_machine-config-daemon.log

       

      Deployments:
      * ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17
                         Digest: sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17
                        Version: 413.86.202302230536-0 (2023-03-08T20:10:47Z)
            RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-372.43.1.el8_6
                LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                                 kernel-rt-modules-extra
      ...
      E0308 22:11:21.925030 74176 writer.go:200] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1: error: Importing: remote error: fetching blob: received unexpected HTTP status: 500 Internal Server Error
      ... 
      I0308 22:11:36.959143   74176 update.go:2010] Running: rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra
      ...
      E0308 22:12:35.525156   74176 writer.go:200] Marking Degraded due to: error running rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra: error: Package/capability 'kernel-rt-core' is not currently requested
      : exit status 1
        

       

      Something is going wrong here in our retry loop.   I think it might be that we don't clear the pending deployment on failure.  IOW we need to

      rpm-ostree cleanup -p 

      before we rertry.

       

      This is fallout from https://github.com/openshift/machine-config-operator/pull/3580 - Although I suspect it may have been an issue before too.

       

            [OCPBUGS-9685] fails to reconcile to RT kernel on interrupted updates

            Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

            OpenShift Jira Automation Bot added a comment - Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

            Closing out a few bugs which were fixed prior to 4.13 GA so that it's easier for us to review which 4.14 blockers may be associated with outstanding 4.13 bugs. This seems safe because the bug was fixed on master branch (4.14) but backported to 4.13 prior to 4.13 GA, therefore there's no need to include these bugs in the release notes for 4.14.

            Scott Dodson added a comment - Closing out a few bugs which were fixed prior to 4.13 GA so that it's easier for us to review which 4.14 blockers may be associated with outstanding 4.13 bugs. This seems safe because the bug was fixed on master branch (4.14) but backported to 4.13 prior to 4.13 GA, therefore there's no need to include these bugs in the release notes for 4.14.

            Verified using IPI on AWS version:

            $ oc get clusterversion
            NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.14.0-0.ci-2023-03-10-110647   True        False         3h38m   Cluster version is 4.14.0-0.ci-2023-03-10-110647
            

            Verification steps:
            1. Create a MC to deploy realtime kernel
            2. Monitor the MCD logs, and execute the command `systemctl kill rpm-ostreed` in the node while rpm-ostree command tries to switch the kernel.

            We can see this in the MCD logs:

            I0310 15:43:34.137558    1966 update.go:2025] Initiating switch to kernel realtime
            I0310 15:43:34.138843    1966 update.go:2010] Running: rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm
            Checking out tree 39ecbca...done
            Enabled rpm-md repositories: coreos-extensions
            Importing rpm-md...done
            rpm-md repo 'coreos-extensions' (cached); generated: 2023-03-06T18:17:14Z solvables: 52
            Resolving dependencies...done 
            Applying 5 overrides and 5 overlays
            Processing packages...done
            Running pre scripts...done
            Running post scripts...done   ------ WE  `SYSTEMCTL KILL RPM-OSTREED` HERE -------
            I0310 15:43:35.637513    1966 update.go:2136] Rolling back applied changes to OS due to error: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: Bus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`.
            : exit status 1
            I0310 15:43:35.637529    1966 update.go:2010] Running: rpm-ostree cleanup -p
            Deployments unchanged.
            I0310 15:43:35.841961    1966 update.go:1627] Password has been configured
            I0310 15:43:35.842065    1966 update.go:1580] Writing SSH keys to "/home/core/.ssh/authorized_keys.d/ignition"
            

            And the pool becomes degraded with this message:

              - lastTransitionTime: "2023-03-10T15:43:41Z"
                message: 'Node ip-10-0-130-187.us-east-2.compute.internal is reporting: "error
                  running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core
                  kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install
                  kernel-rt-modules-extra --install kernel-rt-kvm: \x1b[0m\x1b[31merror: \x1b[0mBus
                  owner changed, aborting. This likely means the daemon crashed; check logs with
                  `journalctl -xe`.\n: exit status 1"'
                reason: 1 nodes are reporting degraded status on sync
                status: "True"
                type: NodeDegraded
            

            The the MCD retries again, it succeeds, the configuration is applied properly and the pool is back to normal again.

            We can see the RT kernel being present in the node after the successful retry

            $ oc debug node/ip-10-0-130-187.us-east-2.compute.internal -- chroot /host uname -a
            Starting pod/ip-10-0-130-187us-east-2computeinternal-debug ...
            To use host binaries, run `chroot /host`
            Linux ip-10-0-130-187 5.14.0-283.rt14.283.el9.x86_64 #1 SMP PREEMPT_RT Fri Feb 24 12:53:58 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
            
            Removing debug pod ...
            
            

            We move the status to VERIFIED.

            Thank you very much!!

            Sergio Regidor de la Rosa added a comment - - edited Verified using IPI on AWS version: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.ci-2023-03-10-110647 True False 3h38m Cluster version is 4.14.0-0.ci-2023-03-10-110647 Verification steps: 1. Create a MC to deploy realtime kernel 2. Monitor the MCD logs, and execute the command `systemctl kill rpm-ostreed` in the node while rpm-ostree command tries to switch the kernel. We can see this in the MCD logs: I0310 15:43:34.137558 1966 update.go:2025] Initiating switch to kernel realtime I0310 15:43:34.138843 1966 update.go:2010] Running: rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm Checking out tree 39ecbca...done Enabled rpm-md repositories: coreos-extensions Importing rpm-md...done rpm-md repo 'coreos-extensions' (cached); generated: 2023-03-06T18:17:14Z solvables: 52 Resolving dependencies...done Applying 5 overrides and 5 overlays Processing packages...done Running pre scripts...done Running post scripts...done ------ WE `SYSTEMCTL KILL RPM-OSTREED` HERE ------- I0310 15:43:35.637513 1966 update.go:2136] Rolling back applied changes to OS due to error: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: Bus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`. : exit status 1 I0310 15:43:35.637529 1966 update.go:2010] Running: rpm-ostree cleanup -p Deployments unchanged. I0310 15:43:35.841961 1966 update.go:1627] Password has been configured I0310 15:43:35.842065 1966 update.go:1580] Writing SSH keys to "/home/core/.ssh/authorized_keys.d/ignition" And the pool becomes degraded with this message: - lastTransitionTime: "2023-03-10T15:43:41Z" message: 'Node ip-10-0-130-187.us-east-2.compute.internal is reporting: "error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: \x1b[0m\x1b[31merror: \x1b[0mBus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`.\n: exit status 1"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded The the MCD retries again, it succeeds, the configuration is applied properly and the pool is back to normal again. We can see the RT kernel being present in the node after the successful retry $ oc debug node/ip-10-0-130-187.us-east-2.compute.internal -- chroot /host uname -a Starting pod/ip-10-0-130-187us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Linux ip-10-0-130-187 5.14.0-283.rt14.283.el9.x86_64 #1 SMP PREEMPT_RT Fri Feb 24 12:53:58 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Removing debug pod ... We move the status to VERIFIED. Thank you very much!!

            Colin Walters added a comment - - edited

            One way to verify this:

             

            • start a switch to kernel-rt
            • have a shell open on the first worker node (oc debug node/ssh)
            • When you see the MCD running rpm-ostree to do the switch, run "systemctl kill rpm-ostreed"

             

            On a cluster without this change (current 4.13 I believe), the MCD should get stuck with the error above.  On a cluster with this change, the MCD should retry and the operation should complete.

            Colin Walters added a comment - - edited One way to verify this:   start a switch to kernel-rt have a shell open on the first worker node (oc debug node/ssh) When you see the MCD running rpm-ostree to do the switch, run "systemctl kill rpm-ostreed"   On a cluster without this change (current 4.13 I believe), the MCD should get stuck with the error above.  On a cluster with this change, the MCD should retry and the operation should complete.

            I haven't yet found evidence of this occuring in 4.12 -> 4.12 -rt install jobs.  I'm starting to suspect this has something to do with rhel8 rpm parsing rhel9 rpms.

            Colin Walters added a comment - I haven't yet found evidence of this occuring in 4.12 -> 4.12 -rt install jobs.  I'm starting to suspect this has something to do with rhel8 rpm parsing rhel9 rpms.

            Colin Walters added a comment - https://bugzilla.redhat.com/show_bug.cgi?id=2177088 will track fixing the segfault

            Colin Walters added a comment - - edited

            OK, I reproduced a segfault:

             

            $ Mar 09 16:59:50 cosa-devsh rpm-ostree[152717]: Initiated txn UpdateDeployment for client(id:cli dbus:1.90 unit:session-4.scope uid:0): /org/projectatomic/rpmostree1/rhcos
            Mar 09 16:59:50 cosa-devsh rpm-ostree[152717]: Process [pid: 263612 uid: 0 unit: session-4.scope] connected to transaction progress
            Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend.
            Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend.
            Mar 09 16:59:52 cosa-devsh rpm-ostree[263619]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend.
            Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: Preparing pkg txn; enabled repos: [] solvables: 0
            Mar 09 16:59:53 cosa-devsh rpm-ostree[152717]: Relabeled 0/4 pkgs
            Mar 09 16:59:53 cosa-devsh systemd[1]: rpm-ostreed.service: Main process exited, code=killed, status=11/SEGV
            Mar 09 16:59:53 cosa-devsh systemd[1]: rpm-ostreed.service: Failed with result 'signal'.
            

             

            But I don't have debuginfo right now.  This is going to be super messy because fixing it is going to need to go to 4.12...unless we decide to live with a retry loop.

             

            I'm starting to suspect that this has always been an issue and it's been papered over by the MCD retry loop for a long time.  Going to go start looking at pod logs from 4.12...

            Colin Walters added a comment - - edited OK, I reproduced a segfault:   $ Mar 09 16:59:50 cosa-devsh rpm-ostree[152717]: Initiated txn UpdateDeployment for client(id:cli dbus:1.90 unit:session-4.scope uid:0): /org/projectatomic/rpmostree1/rhcos Mar 09 16:59:50 cosa-devsh rpm-ostree[152717]: Process [pid: 263612 uid: 0 unit: session-4.scope] connected to transaction progress Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend. Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend. Mar 09 16:59:52 cosa-devsh rpm-ostree[263619]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend. Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: Preparing pkg txn; enabled repos: [] solvables: 0 Mar 09 16:59:53 cosa-devsh rpm-ostree[152717]: Relabeled 0/4 pkgs Mar 09 16:59:53 cosa-devsh systemd[1]: rpm-ostreed.service: Main process exited, code=killed, status=11/SEGV Mar 09 16:59:53 cosa-devsh systemd[1]: rpm-ostreed.service: Failed with result 'signal'.   But I don't have debuginfo right now.  This is going to be super messy because fixing it is going to need to go to 4.12...unless we decide to live with a retry loop.   I'm starting to suspect that this has always been an issue and it's been papered over by the MCD retry loop for a long time.  Going to go start looking at pod logs from 4.12...

            OK I have a qemu image up running a loop

            $ while true; do rpm-ostree reset -oil; rpm-ostree override remove kernel{,-core,-modules,-modules-core,-modules-extra} --install ./kernel-rt-core-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-modules-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-modules-core-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-kvm-5.14.0-283.rt14.283.el9.x86_64.rpm ; done 

            But no failures so far...

            Colin Walters added a comment - OK I have a qemu image up running a loop $ while true; do rpm-ostree reset -oil; rpm-ostree override remove kernel{,-core,-modules,-modules-core,-modules-extra} --install ./kernel-rt-core-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-modules-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-modules-core-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-kvm-5.14.0-283.rt14.283.el9.x86_64.rpm ; done But no failures so far...

            Or, maybe it's that librpm is segfaulting:

            {  nodes/ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5/journal-previous.gz:Mar 08 21:50:52.036738 ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5 kernel: pool[54524]: segfault at 7f0027fc2000 ip 00007f0043e840f2 sp 00007f002b729c50 error 4 in librpm.so.8.2.0[7f0043e42000+77000]
            nodes/ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5/journal.gz:Mar 08 21:50:52.036738 ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5 kernel: pool[54524]: segfault at 7f0027fc2000 ip 00007f0043e840f2 sp 00007f002b729c50 error 4 in librpm.so.8.2.0[7f0043e42000+77000]}
            

             

            (But hmm, could it be that we're not catching that error somehow?)

            Colin Walters added a comment - Or, maybe it's that librpm is segfaulting: { nodes/ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5/journal-previous.gz:Mar 08 21:50:52.036738 ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5 kernel: pool[54524]: segfault at 7f0027fc2000 ip 00007f0043e840f2 sp 00007f002b729c50 error 4 in librpm.so.8.2.0[7f0043e42000+77000] nodes/ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5/journal.gz:Mar 08 21:50:52.036738 ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5 kernel: pool[54524]: segfault at 7f0027fc2000 ip 00007f0043e840f2 sp 00007f002b729c50 error 4 in librpm.so.8.2.0[7f0043e42000+77000]}   (But hmm, could it be that we're not catching that error somehow?)

              walters@redhat.com Colin Walters
              walters@redhat.com Colin Walters
              Rio Liu Rio Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: