[OCPBUGS-9685] fails to reconcile to RT kernel on interrupted updates

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.14.0
Affects Version/s: 4.13.0
Component/s: Machine Config Operator
Labels:
- coreos
- updateRemotingTeam

Regression:
No
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

The aggregated https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-ovn-rt-upgrade-4.14-minor-release-openshift-release-analysis-aggregator/1633554110798106624 job failed. Digging into one of them:

This MCD log has https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1633554106595414016/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-p2vf4_machine-config-daemon.log

Deployments:
* ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17
                   Digest: sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17
                  Version: 413.86.202302230536-0 (2023-03-08T20:10:47Z)
      RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-372.43.1.el8_6
          LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                           kernel-rt-modules-extra
...
E0308 22:11:21.925030 74176 writer.go:200] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1: [0m[31merror: [0mImporting: remote error: fetching blob: received unexpected HTTP status: 500 Internal Server Error
... 
I0308 22:11:36.959143   74176 update.go:2010] Running: rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra
...
E0308 22:12:35.525156   74176 writer.go:200] Marking Degraded due to: error running rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra: [0m[31merror: [0mPackage/capability 'kernel-rt-core' is not currently requested
: exit status 1

Something is going wrong here in our retry loop. I think it might be that we don't clear the pending deployment on failure. IOW we need to

rpm-ostree cleanup -p

before we rertry.

This is fallout from https://github.com/openshift/machine-config-operator/pull/3580 - Although I suspect it may have been an issue before too.

blocks

OCPBUGS-9951 fails to reconcile to RT kernel on interrupted updates

Closed

is cloned by

OCPBUGS-9951 fails to reconcile to RT kernel on interrupted updates

Closed

relates to

COS-1926 Move RHCOS to RHEL 9.2 in OCP 4.13

Closed

links to

https://bugzilla.redhat.com/show_bug.cgi?id=2177088

openshift/machine-config-operator#3599: OCPBUGS-9685: daemon: Always remove pending deployment before we do updates

OpenShift Jira Automation Bot added a comment - 2024/04/29 5:04 PM

Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

OpenShift Jira Automation Bot added a comment - 2024/04/29 5:04 PM Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

Scott Dodson added a comment - 2023/07/11 7:16 PM

Closing out a few bugs which were fixed prior to 4.13 GA so that it's easier for us to review which 4.14 blockers may be associated with outstanding 4.13 bugs. This seems safe because the bug was fixed on master branch (4.14) but backported to 4.13 prior to 4.13 GA, therefore there's no need to include these bugs in the release notes for 4.14.

Scott Dodson added a comment - 2023/07/11 7:16 PM Closing out a few bugs which were fixed prior to 4.13 GA so that it's easier for us to review which 4.14 blockers may be associated with outstanding 4.13 bugs. This seems safe because the bug was fixed on master branch (4.14) but backported to 4.13 prior to 4.13 GA, therefore there's no need to include these bugs in the release notes for 4.14.

Sergio Regidor de la Rosa added a comment - 2023/03/10 4:15 PM - edited

Verified using IPI on AWS version:

$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.ci-2023-03-10-110647   True        False         3h38m   Cluster version is 4.14.0-0.ci-2023-03-10-110647

Verification steps:
1. Create a MC to deploy realtime kernel
2. Monitor the MCD logs, and execute the command `systemctl kill rpm-ostreed` in the node while rpm-ostree command tries to switch the kernel.

We can see this in the MCD logs:

I0310 15:43:34.137558    1966 update.go:2025] Initiating switch to kernel realtime
I0310 15:43:34.138843    1966 update.go:2010] Running: rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm
Checking out tree 39ecbca...done
Enabled rpm-md repositories: coreos-extensions
Importing rpm-md...done
rpm-md repo 'coreos-extensions' (cached); generated: 2023-03-06T18:17:14Z solvables: 52
Resolving dependencies...done 
Applying 5 overrides and 5 overlays
Processing packages...done
Running pre scripts...done
Running post scripts...done   ------ WE  `SYSTEMCTL KILL RPM-OSTREED` HERE -------
I0310 15:43:35.637513    1966 update.go:2136] Rolling back applied changes to OS due to error: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: Bus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`.
: exit status 1
I0310 15:43:35.637529    1966 update.go:2010] Running: rpm-ostree cleanup -p
Deployments unchanged.
I0310 15:43:35.841961    1966 update.go:1627] Password has been configured
I0310 15:43:35.842065    1966 update.go:1580] Writing SSH keys to "/home/core/.ssh/authorized_keys.d/ignition"

And the pool becomes degraded with this message:

  - lastTransitionTime: "2023-03-10T15:43:41Z"
    message: 'Node ip-10-0-130-187.us-east-2.compute.internal is reporting: "error
      running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core
      kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install
      kernel-rt-modules-extra --install kernel-rt-kvm: \x1b[0m\x1b[31merror: \x1b[0mBus
      owner changed, aborting. This likely means the daemon crashed; check logs with
      `journalctl -xe`.\n: exit status 1"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded

The the MCD retries again, it succeeds, the configuration is applied properly and the pool is back to normal again.

We can see the RT kernel being present in the node after the successful retry

$ oc debug node/ip-10-0-130-187.us-east-2.compute.internal -- chroot /host uname -a
Starting pod/ip-10-0-130-187us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Linux ip-10-0-130-187 5.14.0-283.rt14.283.el9.x86_64 #1 SMP PREEMPT_RT Fri Feb 24 12:53:58 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Removing debug pod ...

We move the status to VERIFIED.

Thank you very much!!

Sergio Regidor de la Rosa added a comment - 2023/03/10 4:15 PM - edited Verified using IPI on AWS version: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.ci-2023-03-10-110647 True False 3h38m Cluster version is 4.14.0-0.ci-2023-03-10-110647 Verification steps: 1. Create a MC to deploy realtime kernel 2. Monitor the MCD logs, and execute the command `systemctl kill rpm-ostreed` in the node while rpm-ostree command tries to switch the kernel. We can see this in the MCD logs: I0310 15:43:34.137558 1966 update.go:2025] Initiating switch to kernel realtime I0310 15:43:34.138843 1966 update.go:2010] Running: rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm Checking out tree 39ecbca...done Enabled rpm-md repositories: coreos-extensions Importing rpm-md...done rpm-md repo 'coreos-extensions' (cached); generated: 2023-03-06T18:17:14Z solvables: 52 Resolving dependencies...done Applying 5 overrides and 5 overlays Processing packages...done Running pre scripts...done Running post scripts...done ------ WE `SYSTEMCTL KILL RPM-OSTREED` HERE ------- I0310 15:43:35.637513 1966 update.go:2136] Rolling back applied changes to OS due to error: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: Bus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`. : exit status 1 I0310 15:43:35.637529 1966 update.go:2010] Running: rpm-ostree cleanup -p Deployments unchanged. I0310 15:43:35.841961 1966 update.go:1627] Password has been configured I0310 15:43:35.842065 1966 update.go:1580] Writing SSH keys to "/home/core/.ssh/authorized_keys.d/ignition" And the pool becomes degraded with this message: - lastTransitionTime: "2023-03-10T15:43:41Z" message: 'Node ip-10-0-130-187.us-east-2.compute.internal is reporting: "error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-core kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: \x1b[0m\x1b[31merror: \x1b[0mBus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`.\n: exit status 1"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded The the MCD retries again, it succeeds, the configuration is applied properly and the pool is back to normal again. We can see the RT kernel being present in the node after the successful retry $ oc debug node/ip-10-0-130-187.us-east-2.compute.internal -- chroot /host uname -a Starting pod/ip-10-0-130-187us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Linux ip-10-0-130-187 5.14.0-283.rt14.283.el9.x86_64 #1 SMP PREEMPT_RT Fri Feb 24 12:53:58 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Removing debug pod ... We move the status to VERIFIED. Thank you very much!!

Colin Walters added a comment - 2023/03/10 3:36 PM - edited

One way to verify this:

start a switch to kernel-rt
have a shell open on the first worker node (oc debug node/ssh)
When you see the MCD running rpm-ostree to do the switch, run "systemctl kill rpm-ostreed"

On a cluster without this change (current 4.13 I believe), the MCD should get stuck with the error above. On a cluster with this change, the MCD should retry and the operation should complete.

Colin Walters added a comment - 2023/03/10 3:36 PM - edited One way to verify this: start a switch to kernel-rt have a shell open on the first worker node (oc debug node/ssh) When you see the MCD running rpm-ostree to do the switch, run "systemctl kill rpm-ostreed" On a cluster without this change (current 4.13 I believe), the MCD should get stuck with the error above. On a cluster with this change, the MCD should retry and the operation should complete.

Colin Walters added a comment - 2023/03/09 11:26 PM

I haven't yet found evidence of this occuring in 4.12 -> 4.12 -rt install jobs. I'm starting to suspect this has something to do with rhel8 rpm parsing rhel9 rpms.

Colin Walters added a comment - 2023/03/09 11:26 PM I haven't yet found evidence of this occuring in 4.12 -> 4.12 -rt install jobs. I'm starting to suspect this has something to do with rhel8 rpm parsing rhel9 rpms.

Colin Walters added a comment - 2023/03/09 11:24 PM

https://bugzilla.redhat.com/show_bug.cgi?id=2177088 will track fixing the segfault

Colin Walters added a comment - 2023/03/09 11:24 PM https://bugzilla.redhat.com/show_bug.cgi?id=2177088 will track fixing the segfault

Colin Walters added a comment - 2023/03/09 5:57 PM - edited

OK, I reproduced a segfault:

$ Mar 09 16:59:50 cosa-devsh rpm-ostree[152717]: Initiated txn UpdateDeployment for client(id:cli dbus:1.90 unit:session-4.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Mar 09 16:59:50 cosa-devsh rpm-ostree[152717]: Process [pid: 263612 uid: 0 unit: session-4.scope] connected to transaction progress
Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend.
Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend.
Mar 09 16:59:52 cosa-devsh rpm-ostree[263619]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend.
Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: Preparing pkg txn; enabled repos: [] solvables: 0
Mar 09 16:59:53 cosa-devsh rpm-ostree[152717]: Relabeled 0/4 pkgs
Mar 09 16:59:53 cosa-devsh systemd[1]: rpm-ostreed.service: Main process exited, code=killed, status=11/SEGV
Mar 09 16:59:53 cosa-devsh systemd[1]: rpm-ostreed.service: Failed with result 'signal'.

But I don't have debuginfo right now. This is going to be super messy because fixing it is going to need to go to 4.12...unless we decide to live with a retry loop.

I'm starting to suspect that this has always been an issue and it's been papered over by the MCD retry loop for a long time. Going to go start looking at pod logs from 4.12...

Colin Walters added a comment - 2023/03/09 5:57 PM - edited OK, I reproduced a segfault: $ Mar 09 16:59:50 cosa-devsh rpm-ostree[152717]: Initiated txn UpdateDeployment for client(id:cli dbus:1.90 unit:session-4.scope uid:0): /org/projectatomic/rpmostree1/rhcos Mar 09 16:59:50 cosa-devsh rpm-ostree[152717]: Process [pid: 263612 uid: 0 unit: session-4.scope] connected to transaction progress Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend. Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend. Mar 09 16:59:52 cosa-devsh rpm-ostree[263619]: warning: Found SQLITE rpmdb.sqlite database while attempting bdb backend: using sqlite backend. Mar 09 16:59:52 cosa-devsh rpm-ostree[152717]: Preparing pkg txn; enabled repos: [] solvables: 0 Mar 09 16:59:53 cosa-devsh rpm-ostree[152717]: Relabeled 0/4 pkgs Mar 09 16:59:53 cosa-devsh systemd[1]: rpm-ostreed.service: Main process exited, code=killed, status=11/SEGV Mar 09 16:59:53 cosa-devsh systemd[1]: rpm-ostreed.service: Failed with result 'signal'. But I don't have debuginfo right now. This is going to be super messy because fixing it is going to need to go to 4.12...unless we decide to live with a retry loop. I'm starting to suspect that this has always been an issue and it's been papered over by the MCD retry loop for a long time. Going to go start looking at pod logs from 4.12...

Colin Walters added a comment - 2023/03/09 4:54 PM

OK I have a qemu image up running a loop

$ while true; do rpm-ostree reset -oil; rpm-ostree override remove kernel{,-core,-modules,-modules-core,-modules-extra} --install ./kernel-rt-core-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-modules-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-modules-core-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-kvm-5.14.0-283.rt14.283.el9.x86_64.rpm ; done

But no failures so far...

Colin Walters added a comment - 2023/03/09 4:54 PM OK I have a qemu image up running a loop $ while true; do rpm-ostree reset -oil; rpm-ostree override remove kernel{,-core,-modules,-modules-core,-modules-extra} --install ./kernel-rt-core-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-modules-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-modules-core-5.14.0-283.rt14.283.el9.x86_64.rpm --install ./kernel-rt-kvm-5.14.0-283.rt14.283.el9.x86_64.rpm ; done But no failures so far...

Colin Walters added a comment - 2023/03/09 12:11 PM

Or, maybe it's that librpm is segfaulting:

{  nodes/ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5/journal-previous.gz:Mar 08 21:50:52.036738 ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5 kernel: pool[54524]: segfault at 7f0027fc2000 ip 00007f0043e840f2 sp 00007f002b729c50 error 4 in librpm.so.8.2.0[7f0043e42000+77000]
nodes/ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5/journal.gz:Mar 08 21:50:52.036738 ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5 kernel: pool[54524]: segfault at 7f0027fc2000 ip 00007f0043e840f2 sp 00007f002b729c50 error 4 in librpm.so.8.2.0[7f0043e42000+77000]}

(But hmm, could it be that we're not catching that error somehow?)

Colin Walters added a comment - 2023/03/09 12:11 PM Or, maybe it's that librpm is segfaulting: { nodes/ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5/journal-previous.gz:Mar 08 21:50:52.036738 ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5 kernel: pool[54524]: segfault at 7f0027fc2000 ip 00007f0043e840f2 sp 00007f002b729c50 error 4 in librpm.so.8.2.0[7f0043e42000+77000] nodes/ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5/journal.gz:Mar 08 21:50:52.036738 ci-op-snvcvxtm-d5fb3-f8qqc-worker-a-fx7l5 kernel: pool[54524]: segfault at 7f0027fc2000 ip 00007f0043e840f2 sp 00007f002b729c50 error 4 in librpm.so.8.2.0[7f0043e42000+77000]} (But hmm, could it be that we're not catching that error somehow?)

Assignee:: Colin Walters

Reporter:: Colin Walters

QA Contact:: Rio Liu

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/03/09 12:35 AM

Updated:: 2024/04/29 5:04 PM

Resolved:: 2023/07/11 7:16 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: OpenShift Jira Automation Bot added a comment - 2024/04/29 5:04 PM

Expand comment: OpenShift Jira Automation Bot added a comment - 2024/04/29 5:04 PM

Collapse comment: Scott Dodson added a comment - 2023/07/11 7:16 PM

Expand comment: Scott Dodson added a comment - 2023/07/11 7:16 PM

Collapse comment: Sergio Regidor de la Rosa added a comment - 2023/03/10 4:15 PM, Edited by Sergio Regidor de la Rosa - 2023/03/10 4:15 PM

Expand comment: Sergio Regidor de la Rosa added a comment - 2023/03/10 4:15 PM, Edited by Sergio Regidor de la Rosa - 2023/03/10 4:15 PM

Collapse comment: Colin Walters added a comment - 2023/03/10 3:36 PM, Edited by Colin Walters - 2023/03/10 3:37 PM

Expand comment: Colin Walters added a comment - 2023/03/10 3:36 PM, Edited by Colin Walters - 2023/03/10 3:37 PM

Collapse comment: Colin Walters added a comment - 2023/03/09 11:26 PM

Expand comment: Colin Walters added a comment - 2023/03/09 11:26 PM

Collapse comment: Colin Walters added a comment - 2023/03/09 11:24 PM

Expand comment: Colin Walters added a comment - 2023/03/09 11:24 PM

Collapse comment: Colin Walters added a comment - 2023/03/09 5:57 PM, Edited by Colin Walters - 2023/03/09 7:36 PM

Expand comment: Colin Walters added a comment - 2023/03/09 5:57 PM, Edited by Colin Walters - 2023/03/09 7:36 PM

Collapse comment: Colin Walters added a comment - 2023/03/09 4:54 PM

Expand comment: Colin Walters added a comment - 2023/03/09 4:54 PM

Collapse comment: Colin Walters added a comment - 2023/03/09 12:11 PM

Expand comment: Colin Walters added a comment - 2023/03/09 12:11 PM

People

Dates