-
Story
-
Resolution: Done
-
Blocker
-
None
-
None
-
False
-
None
-
False
-
-
aggregated-gcp-ovn-upgrade-4.16-micro
This happened on 8 of 10 jobs in that aggregated job (the other 2 didn't get past install).
: Node process segfaulted expand_less 0s { nodes/ci-op-8zbhh82n-f9945-lvzfz-master-0/journal-previous.gz:Mar 20 06:14:07.554815 ci-op-8zbhh82n-f9945-lvzfz-master-0 kernel: NetworkManager[1192]: segfault at 1 ip 00005617e33ec719 sp 00007ffe03abdc70 error 4 in NetworkManager[5617e32ef000+273000] likely on CPU 5 (core 2, socket 0) nodes/ci-op-8zbhh82n-f9945-lvzfz-master-0/journal.gz:Mar 20 06:14:07.554815 ci-op-8zbhh82n-f9945-lvzfz-master-0 kernel: NetworkManager[1192]: segfault at 1 ip 00005617e33ec719 sp 00007ffe03abdc70 error 4 in NetworkManager[5617e32ef000+273000] likely on CPU 5 (core 2, socket 0) nodes/ci-op-8zbhh82n-f9945-lvzfz-master-1/journal-previous.gz:Mar 20 06:19:20.823335 ci-op-8zbhh82n-f9945-lvzfz-master-1 kernel: NetworkManager[1186]: segfault at 1 ip 000055a1a4e7a719 sp 00007ffdb4b16da0 error 4 in NetworkManager[55a1a4d7d000+273000] likely on CPU 4 (core 1, socket 0)
All of the 8 that ran, passed, but aggregator fails on finding a segFault.
Looking at one of those jobs, I see that that segFault happens in the exact same way on all nodes (right at reboot – which is probably why the segFault is of no consequence).
Also, this azure job also had the symptom but it ran 2024-03-19 02:00:07 on 4.16.0-0.nightly-2024-03-19-015701 (as an informing job).
These 4 jobs have the segFault but only on the workers (and ran various recent versions of 4.16 nightly starting with 4.16.0-0.nightly-2024-03-19-015701):
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1770334574610485248
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1770200817094103040
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1770065002347106304
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1769906695716212736
So, this happens on gcp-ovn-upgrade (all nodes get the segFault), for non-upgrade gcp-ovn-rt-jobs (only the 3 workers get the segFault), and one azure-sdn upgrade job (all nodes get the segFault) showed the symptom.
For all jobs that got the segFault, the version shown in the RHCOS is NetworkManager-1-1.47.2-1.el9-x86_64. The prow jobs that get one test failing the job will show as passed.
You cannot search specifically for this particular failure pattern on sippy because the "Node process segfaulted" test is general for other processes (including "slapd").