Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Blocker
Fix Version/s: None
Affects Version/s: None
Labels:
- trt-incident

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-03-20-044408

aggregated-gcp-ovn-upgrade-4.16-micro

This happened on 8 of 10 jobs in that aggregated job (the other 2 didn't get past install).

: Node process segfaulted expand_less	0s
{  nodes/ci-op-8zbhh82n-f9945-lvzfz-master-0/journal-previous.gz:Mar 20 06:14:07.554815 ci-op-8zbhh82n-f9945-lvzfz-master-0 kernel: NetworkManager[1192]: segfault at 1 ip 00005617e33ec719 sp 00007ffe03abdc70 error 4 in NetworkManager[5617e32ef000+273000] likely on CPU 5 (core 2, socket 0)
nodes/ci-op-8zbhh82n-f9945-lvzfz-master-0/journal.gz:Mar 20 06:14:07.554815 ci-op-8zbhh82n-f9945-lvzfz-master-0 kernel: NetworkManager[1192]: segfault at 1 ip 00005617e33ec719 sp 00007ffe03abdc70 error 4 in NetworkManager[5617e32ef000+273000] likely on CPU 5 (core 2, socket 0)
nodes/ci-op-8zbhh82n-f9945-lvzfz-master-1/journal-previous.gz:Mar 20 06:19:20.823335 ci-op-8zbhh82n-f9945-lvzfz-master-1 kernel: NetworkManager[1186]: segfault at 1 ip 000055a1a4e7a719 sp 00007ffdb4b16da0 error 4 in NetworkManager[55a1a4d7d000+273000] likely on CPU 4 (core 1, socket 0)

All of the 8 that ran, passed, but aggregator fails on finding a segFault.

Looking at one of those jobs, I see that that segFault happens in the exact same way on all nodes (right at reboot – which is probably why the segFault is of no consequence).

Also, this azure job also had the symptom but it ran 2024-03-19 02:00:07 on 4.16.0-0.nightly-2024-03-19-015701 (as an informing job).

These 4 jobs have the segFault but only on the workers (and ran various recent versions of 4.16 nightly starting with 4.16.0-0.nightly-2024-03-19-015701):

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1770334574610485248
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1770200817094103040
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1770065002347106304
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1769906695716212736

So, this happens on gcp-ovn-upgrade (all nodes get the segFault), for non-upgrade gcp-ovn-rt-jobs (only the 3 workers get the segFault), and one azure-sdn upgrade job (all nodes get the segFault) showed the symptom.

For all jobs that got the segFault, the version shown in the RHCOS is NetworkManager-1-1.47.2-1.el9-x86_64. The prow jobs that get one test failing the job will show as passed.
You cannot search specifically for this particular failure pattern on sippy because the "Node process segfaulted" test is general for other processes (including "slapd").

Assignee:: Dennis Periquet

Reporter:: Dennis Periquet

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/03/20 12:56 PM

Updated:: 2024/03/25 12:15 PM

Resolved:: 2024/03/25 12:15 PM

Details

Description

Attachments

Activity

People

Dates