Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18919

NetworkManager attempts to manage Azure Accelerated Networking VF interfaces

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      On Azure Accelerated Networking VMs there is an additional network interface that corresponds to the Mellanox SR-IOV virtual function. This is a slave interface and should not be managed. To accomplish that, we ship a udev rule /usr/lib/udev/rules.d/68-azure-sriov-nm-unmanaged.rules.
      
      For rhcos 8.x nodes (ocp 4.10.x) the rule is not effective, NetworkManager tries to constantly bring up DHCP on the VF interface.
      
      For rhcos 9.x nodes (ocp 4.13.x) we do see that the rule is effective. The udev rule is the same between the two versions.
      
      We are seeing this on multiple clusters in ARO, but presumably other non-ARO Azure clusters have the same issue.

      Version-Release number of selected component (if applicable):

      ARO OCP 4.10.63

      How reproducible:

      Always

      Steps to Reproduce:

      On an ARO cluster, configure a machineset with `acceleratedNetworking: true`. Then get a node debug shell, run nmcli, and observe that the enP* interface is not set to unmanaged.
      
      Udev rule evaluation is provided for both versions for comparison.

      Actual results:

      ### this is 4.10.63
      sh-4.4# nmcli
      enP64657s1: connecting (getting IP configuration) to Wired Connection
              "Mellanox MT27500/MT27520"
              ethernet (mlx4_core), 00:0D:3A:1C:B6:7E, hw, mtu 1500
      
      sh-4.4# udevadm info /sys/class/net/enP64657s1 
      P: /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f08e4542-fc91-4540-b468-241618eeb6f1/pcifc91:00/fc91:00:02.0/net/enP64657s1
      E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f08e4542-fc91-4540-b468-241618eeb6f1/pcifc91:00/fc91:00:02.0/net/enP64657s1
      E: ID_BUS=pci
      E: ID_MODEL_FROM_DATABASE=MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
      E: ID_MODEL_ID=0x1004
      E: ID_NET_DRIVER=mlx4_en
      E: ID_NET_LINK_FILE=/usr/lib/systemd/network/99-default.link
      E: ID_NET_NAME=enP64657s1
      E: ID_NET_NAME_MAC=enx000d3a1cb67e
      E: ID_NET_NAME_PATH=enP64657p0s2
      E: ID_NET_NAME_SLOT=enP64657s1
      E: ID_NET_NAMING_SCHEME=rhel-8.0
      E: ID_OUI_FROM_DATABASE=Microsoft Corp.
      E: ID_PATH=acpi-VMBUS:01-pci-fc91:00:02.0
      E: ID_PATH_TAG=acpi-VMBUS_01-pci-fc91_00_02_0
      E: ID_PCI_CLASS_FROM_DATABASE=Network controller
      E: ID_PCI_SUBCLASS_FROM_DATABASE=Ethernet controller
      E: ID_VENDOR_FROM_DATABASE=Mellanox Technologies
      E: ID_VENDOR_ID=0x15b3
      E: IFINDEX=3
      E: INTERFACE=enP64657s1
      E: NM_UNMANAGED=1
      E: SUBSYSTEM=net
      E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/enP64657s1
      E: TAGS=:systemd:
      E: USEC_INITIALIZED=106604677
      
      sh-4.4# cat /host/usr/lib/udev/rules.d/68-azure-sriov-nm-unmanaged.rules
      # Accelerated Networking on Azure exposes a new SRIOV interface to the VM.
      # This interface is transparently bonded to the synthetic interface,
      # so NetworkManager should just ignore any SRIOV interfaces.
      SUBSYSTEM=="net", DRIVERS=="hv_pci", ACTION=="add|change|move", ENV{NM_UNMANAGED}="1"

      Expected results:

      enP15620s1: unmanaged
              "Mellanox MT27500/MT27520"
              ethernet (mlx4_core), 00:0D:3A:9B:C8:4F, hw, mtu 1500
      
      sh-5.1# udevadm info /sys/class/net/enP15620s1
      P: /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/eedca831-3d04-4e81-ab11-54d44b14a726/pci3d04:00/3d04:00:02.0/net/enP15620s1
      M: enP15620s1
      R: 1
      U: net
      I: 3
      E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/eedca831-3d04-4e81-ab11-54d44b14a726/pci3d04:00/3d04:00:02.0/net/enP15620s1
      E: SUBSYSTEM=net
      E: INTERFACE=enP15620s1
      E: IFINDEX=3
      E: USEC_INITIALIZED=9986811
      E: NM_UNMANAGED=1
      E: ID_NET_NAMING_SCHEME=rhel-9.0
      E: ID_NET_NAME_MAC=enx000d3a9bc84f
      E: ID_OUI_FROM_DATABASE=Microsoft Corp.
      E: ID_NET_NAME_PATH=enP15620p0s2
      E: ID_NET_NAME_SLOT=enP15620s1
      E: ID_BUS=pci
      E: ID_VENDOR_ID=0x15b3
      E: ID_MODEL_ID=0x1004
      E: ID_PCI_CLASS_FROM_DATABASE=Network controller
      E: ID_PCI_SUBCLASS_FROM_DATABASE=Ethernet controller
      E: ID_VENDOR_FROM_DATABASE=Mellanox Technologies
      E: ID_MODEL_FROM_DATABASE=MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
      E: ID_PATH=acpi-VMBUS:01-pci-3d04:00:02.0
      E: ID_PATH_TAG=acpi-VMBUS_01-pci-3d04_00_02_0
      E: ID_NET_DRIVER=mlx4_en
      E: ID_NET_LINK_FILE=/etc/systemd/network/98-nmstate-enP15620s1.link
      E: ID_NET_NAME=enP15620s1
      E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/enP15620s1
      E: TAGS=:systemd:
      E: CURRENT_TAGS=:systemd:
      
      sh-5.1# cat /usr/lib/udev/rules.d/68-azure-sriov-nm-unmanaged.rules
      # Accelerated Networking on Azure exposes a new SRIOV interface to the VM.
      # This interface is transparently bonded to the synthetic interface,
      # so NetworkManager should just ignore any SRIOV interfaces.
      SUBSYSTEM=="net", DRIVERS=="hv_pci", ACTION=="add|change|move", ENV{NM_UNMANAGED}="1"

      Additional info:

      The udev rule is setting NM_UNMANAGED=1 as intended. NetworkManager isn't honoring that flag. Logs show NM repeatedly trying to DHCP this interface:
      
      sh-4.4# journalctl -b NM_DEVICE=enP64657s1
      ...
      Sep 12 22:28:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <warn>  [1694557729.8525] dhcp4 (enP64657s1): request timed out
      Sep 12 22:28:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694557729.8526] dhcp4 (enP64657s1): state changed unknown -> timeout
      Sep 12 22:28:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694557729.8527] device (enP64657s1): state change: ip-config -> failed (reason 'ip-config-unavai>
      Sep 12 22:28:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <warn>  [1694557729.8542] device (enP64657s1): Activation: failed for connection 'Wired Connection'
      Sep 12 22:28:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694557729.8545] device (enP64657s1): state change: failed -> disconnected (reason 'none', sys-if>
      Sep 12 22:28:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694557729.8827] dhcp4 (enP64657s1): canceled DHCP transaction
      Sep 12 22:28:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694557729.8828] dhcp4 (enP64657s1): state changed timeout -> done
      Sep 12 22:33:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558029.8528] device (enP64657s1): Activation: starting connection 'Wired Connection' (1667573>
      Sep 12 22:33:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558029.8530] device (enP64657s1): state change: disconnected -> prepare (reason 'none', sys-i>
      Sep 12 22:33:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558029.8534] device (enP64657s1): state change: prepare -> config (reason 'none', sys-iface-s>
      Sep 12 22:33:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558029.8542] device (enP64657s1): state change: config -> ip-config (reason 'none', sys-iface>
      Sep 12 22:33:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558029.8547] dhcp4 (enP64657s1): activation: beginning transaction (timeout in 90 seconds)
      Sep 12 22:33:49 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <warn>  [1694558029.8611] device (enP64657s1): linklocal6: DAD failed for an EUI-64 address
      Sep 12 22:35:19 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <warn>  [1694558119.8726] dhcp4 (enP64657s1): request timed out
      Sep 12 22:35:19 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558119.8727] dhcp4 (enP64657s1): state changed unknown -> timeout
      Sep 12 22:35:19 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558119.8728] device (enP64657s1): state change: ip-config -> failed (reason 'ip-config-unavai>
      Sep 12 22:35:19 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <warn>  [1694558119.8744] device (enP64657s1): Activation: failed for connection 'Wired Connection'
      Sep 12 22:35:19 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558119.8751] device (enP64657s1): state change: failed -> disconnected (reason 'none', sys-if>
      Sep 12 22:35:19 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558119.9046] dhcp4 (enP64657s1): canceled DHCP transaction
      Sep 12 22:35:19 aro-adenton-6l5bx-worker-eastus1-cnnv5 NetworkManager[1440]: <info>  [1694558119.9046] dhcp4 (enP64657s1): state changed timeout -> done

            rhn-gps-dmabe Dusty Mabe
            rh-ee-adenton Andrew Denton
            Michael Nguyen Michael Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: