Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1571

4.16 ci payload rejected, Node process segfaulted at reboot time, NetworkManager

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Blocker Blocker
    • None
    • None
    • False
    • None
    • False

      https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-03-20-044408

      aggregated-gcp-ovn-upgrade-4.16-micro

      This happened on 8 of 10 jobs in that aggregated job (the other 2 didn't get past install).

      : Node process segfaulted expand_less	0s
      {  nodes/ci-op-8zbhh82n-f9945-lvzfz-master-0/journal-previous.gz:Mar 20 06:14:07.554815 ci-op-8zbhh82n-f9945-lvzfz-master-0 kernel: NetworkManager[1192]: segfault at 1 ip 00005617e33ec719 sp 00007ffe03abdc70 error 4 in NetworkManager[5617e32ef000+273000] likely on CPU 5 (core 2, socket 0)
      nodes/ci-op-8zbhh82n-f9945-lvzfz-master-0/journal.gz:Mar 20 06:14:07.554815 ci-op-8zbhh82n-f9945-lvzfz-master-0 kernel: NetworkManager[1192]: segfault at 1 ip 00005617e33ec719 sp 00007ffe03abdc70 error 4 in NetworkManager[5617e32ef000+273000] likely on CPU 5 (core 2, socket 0)
      nodes/ci-op-8zbhh82n-f9945-lvzfz-master-1/journal-previous.gz:Mar 20 06:19:20.823335 ci-op-8zbhh82n-f9945-lvzfz-master-1 kernel: NetworkManager[1186]: segfault at 1 ip 000055a1a4e7a719 sp 00007ffdb4b16da0 error 4 in NetworkManager[55a1a4d7d000+273000] likely on CPU 4 (core 1, socket 0)
      

      All of the 8 that ran, passed, but aggregator fails on finding a segFault.

      Looking at one of those jobs, I see that that segFault happens in the exact same way on all nodes (right at reboot – which is probably why the segFault is of no consequence).

      Also, this azure job also had the symptom but it ran 2024-03-19 02:00:07 on 4.16.0-0.nightly-2024-03-19-015701 (as an informing job).

      These 4 jobs have the segFault but only on the workers (and ran various recent versions of 4.16 nightly starting with 4.16.0-0.nightly-2024-03-19-015701):

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1770334574610485248
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1770200817094103040
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1770065002347106304
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-rt/1769906695716212736

      So, this happens on gcp-ovn-upgrade (all nodes get the segFault), for non-upgrade gcp-ovn-rt-jobs (only the 3 workers get the segFault), and one azure-sdn upgrade job (all nodes get the segFault) showed the symptom.

      For all jobs that got the segFault, the version shown in the RHCOS is NetworkManager-1-1.47.2-1.el9-x86_64. The prow jobs that get one test failing the job will show as passed.
      You cannot search specifically for this particular failure pattern on sippy because the "Node process segfaulted" test is general for other processes (including "slapd").

            dperique@redhat.com Dennis Periquet
            dperique@redhat.com Dennis Periquet
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: