Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-29856

NetworkManager: segfault at 1 running openshift prow job

    • NetworkManager-1.47.5-1.el9
    • None
    • Moderate
    • 1
    • sst_network_management
    • ssg_networking
    • 10
    • 2
    • False
    • Hide

      None

      Show
      None
    • No
    • NMT - RHEL-9.5 DTM 8
    • Hide

      Given a prow job running on Fedora 40 with NetworkManager installed,

      When the system undergoes an upgrade and reboots,

      Then NetworkManager should not segfault during the boot process.

      Definition of Done:

      • The implementation meets the acceptance criteria
      • The code is part of a downstream build attached to an errata

      AC and QE test alignment:
      Comment https://issues.redhat.com/browse/RHEL-29856?focusedId=24608166&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24608166 confirms that there is no segfault anymore during the boot process. Therefore, the acceptance criteria is covered.
       

      Show
      Given a prow job running on Fedora 40 with NetworkManager installed, When the system undergoes an upgrade and reboots, Then NetworkManager should not segfault during the boot process. Definition of Done : The implementation meets the acceptance criteria The code is part of a downstream build attached to an errata AC and QE test alignment : Comment https://issues.redhat.com/browse/RHEL-29856?focusedId=24608166&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24608166 confirms that there is no segfault anymore during the boot process. Therefore, the acceptance criteria is covered.  
    • Pass
    • None
    • x86_64
    • None

      What were you trying to do that didn't work?

      On a prow job such as https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade/1770310694600708096 we are seeing:

       
      : Node process segfaulted expand_less 0s
      { nodes/ci-op-8zbhh82n-f9945-lvzfz-master-0/journal-previous.gz:Mar 20 06:14:07.554815 ci-op-8zbhh82n-f9945-lvzfz-master-0 kernel: NetworkManager[1192]: segfault at 1 ip 00005617e33ec719 sp 00007ffe03abdc70 error 4 in NetworkManager[5617e32ef000+273000] likely on CPU 5 (core 2, socket 0)

      upon node reboot.

      node logs are with that prow job's artifacts: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade/1770310694600708096/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/nodes/

      Sample log for master-0:

      Mar 20 06:14:07.535458 ci-op-8zbhh82n-f9945-lvzfz-master-0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-30de036365d23d0bfd70e28276592c9c.
      Mar 20 06:14:07.539590 ci-op-8zbhh82n-f9945-lvzfz-master-0 root[322940]: machine-config-daemon[308499]: reboot successful
      Mar 20 06:14:07.549869 ci-op-8zbhh82n-f9945-lvzfz-master-0 systemd-logind[980]: The system will reboot now!
      Mar 20 06:14:07.553685 ci-op-8zbhh82n-f9945-lvzfz-master-0 systemd-logind[980]: System is rebooting.
      Mar 20 06:14:07.554815 ci-op-8zbhh82n-f9945-lvzfz-master-0 kernel: NetworkManager[1192]: segfault at 1 ip 00005617e33ec719 sp 00007ffe03abdc70 error 4 in NetworkManager[5617e32ef000+273000] likely on CPU 5 (core 2, socket 0)
      Mar 20 06:14:07.554905 ci-op-8zbhh82n-f9945-lvzfz-master-0 kernel: Code: a1 24 00 5b 48 89 ef 5d 41 5c 48 8b 40 30 ff e0 90 f3 0f 1e fa 55 48 85 f6 0f 84 82 00 00 00 48 89 f5 e8 5a b4 f1 ff 48 89 c6 <48> 8b 45 00 48 85 c0 74 05 48 3b 30 74 0c 48 89 ef e8 a1 8f f0 ff
      Mar 20 06:14:07.588310 ci-op-8zbhh82n-f9945-lvzfz-master-0 systemd[1]: machine-config-daemon-reboot.service: Deactivated successfully.
      Mar 20 06:14:07.588624 ci-op-8zbhh82n-f9945-lvzfz-master-0 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-30de036365d23d0bfd70e28276592c9c.
      Mar 20 06:14:07.590597 ci-op-8zbhh82n-f9945-lvzfz-master-0 systemd[1]: Stopping crio-conmon-7f350033cb3265beff7c7bb193639a4d1efb9da9a8dd1f8bd5d6bfcae3124ef5.scope...

      Please provide the package NVR for which bug is seen:

      NetworkManager-1-1.47.2-1.el9-x86_64

      How reproducible:

      Since that is 1 of 8 jobs that ran at the same time and all 8 jobs saw the problem happen where all nodes (6 each) all got the problem, it seems reproducible.

      Steps to reproduce

      1. this appears to be happening during an upgrade (as I see this is the second reboot in that node's log
      2.  
      3.  

      Expected results

      No segFault.

      Actual results

      segFault.  I noticed unfortunately that there are no core dumps.

      since the segFault happens at boot it seems inconsequential but our test looks for segFaults so it fails.

              bgalvani@redhat.com Beniamino Galvani
              dperique@redhat.com Dennis Periquet
              Network Management Team Network Management Team
              Filip Pokryvka Filip Pokryvka
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: