Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-511

RHCOS + OCP 4.11: NetworkManager fails to initialize an IPv6 NIC

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • None
    • 4.11
    • RHCOS
    • None
    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:
      This problem was first discovered with assisted installer CI (slack discussion).
      Upon boot, CRIO failed to bind to an IPv6 address:

      [root@ostest-extraworker-0 ~]# systemctl status crio
      ● crio.service - Container Runtime Interface for OCI (CRI-O)
         Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/crio.service.d
                 └─10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf
         Active: failed (Result: exit-code) since Tue 2022-08-23 10:25:59 UTC; 7min ago
           Docs: https://github.com/cri-o/cri-o
        Process: 2415 ExecStart=/usr/bin/crio $CRIO_CONFIG_OPTIONS $CRIO_RUNTIME_OPTIONS $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
       Main PID: 2415 (code=exited, status=1/FAILURE)
            CPU: 90ms
      
      Aug 23 10:25:59 ostest-extraworker-0 crio[2415]: time="2022-08-23 10:25:59.602701981Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time"
      Aug 23 10:25:59 ostest-extraworker-0 crio[2415]: time="2022-08-23 10:25:59.602708133Z" level=info msg="No blockio config file specified, blockio not configured"
      Aug 23 10:25:59 ostest-extraworker-0 crio[2415]: time="2022-08-23 10:25:59.602713412Z" level=info msg="RDT not available in the host system"
      Aug 23 10:25:59 ostest-extraworker-0 crio[2415]: time="2022-08-23 10:25:59.602961157Z" level=info msg="Updated default CNI network name to "
      Aug 23 10:25:59 ostest-extraworker-0 crio[2415]: time="2022-08-23 10:25:59.634908836Z" level=warning msg="Error encountered when checking whether cri-o should wipe images: version file /var/lib/crio/version not found: open /var/lib/crio/>
      Aug 23 10:25:59 ostest-extraworker-0 crio[2415]: time="2022-08-23 10:25:59.635765832Z" level=fatal msg="Failed to start streaming server: listen tcp [fd2e:6f44:5dd8:c956::1a]:10010: bind: cannot assign requested address"
      Aug 23 10:25:59 ostest-extraworker-0 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
      Aug 23 10:25:59 ostest-extraworker-0 systemd[1]: crio.service: Failed with result 'exit-code'.
      Aug 23 10:25:59 ostest-extraworker-0 systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O).
      Aug 23 10:25:59 ostest-extraworker-0 systemd[1]: crio.service: Consumed 90ms CPU time
      

      Yet, it did recover when I restarted the service via systemd.

      Looking closer at that machine, I noticed that this NetworkManager is also failing:

      [systemd]
      Failed Units: 1
        NetworkManager-wait-online.service
      

      journalctl log indicates that there were issues activating the NIC, which probably failed CRIO to bind since there was no IPv6 address allocated at the time of the bind attempt:

      Aug 23 10:43:10 ostest-extraworker-0 NetworkManager[1368]: <info>  [1661251390.9762] device (enp1s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
      Aug 23 10:43:10 ostest-extraworker-0 NetworkManager[1368]: <warn>  [1661251390.9765] device (enp1s0): Activation: failed for connection 'Wired connection 1'
      Aug 23 10:43:10 ostest-extraworker-0 NetworkManager[1368]: <info>  [1661251390.9767] device (enp1s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
      Aug 23 10:43:11 ostest-extraworker-0 NetworkManager[1368]: <info>  [1661251391.0055] dhcp4 (enp1s0): canceled DHCP transaction
      
      

      Version-Release number of selected component (if applicable):
      RHCOS + OCP 4.11

      How reproducible:
      100% --> see those CI runs

      Steps to Reproduce:
      1. Reproduced via CI as mentioned above. The failing node is the spoke cluster VM, which is configured to IPv6 with DHCP.

      Actual results:
      As mentioned above.

      Expected results:
      NetworkManager to operate appropriately and DHCP to work.

      Additional info:
      Does not reproduce with OCP 4.10: see CI results for this PR.

      This is a disconnected env, so it has no SOS report deployed. Instead, we captured the following:

      demsg log
      journalctl log
      ps -ef output
      ip a
      netstat -tnlpu

              Unassigned Unassigned
              nmagnezi@redhat.com Nir Magnezi (Inactive)
              Michael Nguyen Michael Nguyen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: