Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-58338

Inconsistent Network Interface Order in Pods Across Nodes for Multi-NIC SR-IOV Setups

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.17.0
    • Networking / SR-IOV
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 1.5
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When a pod is configured with multiple SR-IOV network attachments, the resulting logical interface names inside the pod (e.g., net1, net2, net3) are not deterministic. The mapping of these logical names to the underlying physical devices is inconsistent across different nodes in the cluster, even when the node hardware and SR-IOV configuration are identical.

      Pod 1 (Master)
      HCA mlx5_18 for interface net1
      HCA mlx5_14 for interface net2
      HCA mlx5_10 for interface net3
      HCA mlx5_6 for interface net4

      Pod 2 (Worker)
      HCA mlx5_6 for interface net1
      HCA mlx5_10 for interface net2
      HCA mlx5_14 for interface net3
      HCA mlx5_18 for interface net4

      Master:

      net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
              inet 192.168.5.3  netmask 255.255.255.0  broadcast 192.168.5.255
      --
      net2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
              inet 192.168.6.3  netmask 255.255.255.0  broadcast 192.168.6.255
      --
      net3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
              inet 192.168.7.3  netmask 255.255.255.0  broadcast 192.168.7.255
      --
      net4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
              inet 192.168.8.3  netmask 255.255.255.0  broadcast 192.168.8.255

      Worker:

      net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
              inet 192.168.5.4  netmask 255.255.255.0  broadcast 192.168.5.255
      --
      net2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
              inet 192.168.6.4  netmask 255.255.255.0  broadcast 192.168.6.255
      --
      net3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
              inet 192.168.7.4  netmask 255.255.255.0  broadcast 192.168.7.255
      --
      net4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
              inet 192.168.8.4  netmask 255.255.255.0  broadcast 192.168.8.255

      This behavior makes running tightly-coupled distributed workloads, such as HPC applications and AI/ML training models, non-functional since NCCL is expecting HCA's will match the interfaces NICs across all involved pods/nodes
       

      Expected Behavior
      The logical interface order should be deterministic and consistent across all pods using the same annotation. The first network requested in the annotation should always become net1, the second should become net2, and so on. This would result in an identical mapping on all nodes.

      exmaple for the config I used breaking down a NIC with 4 VF's to 4 resources: 

      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: cx7-eno5np0-vf0
        namespace: openshift-sriov-network-operator
      spec:
        resourceName: cx7_eno5_vf0
        numVfs: 4
        priority: 10
        nodeSelector:
          feature.node.kubernetes.io/network-sriov.capable: "true"
        nicSelector:
          pfNames: ["eno5np0#0-0"]
          vendor: "15b3"
          deviceID: "1021"
        deviceType: netdevice
        isRdma: true
      ---
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: cx7-eno5np0-vf1
        namespace: openshift-sriov-network-operator
      spec:
        resourceName: cx7_eno5_vf1
        numVfs: 4
        priority: 10
        nodeSelector:
          feature.node.kubernetes.io/network-sriov.capable: "true"
        nicSelector:
          pfNames: ["eno5np0#1-1"]
          vendor: "15b3"
          deviceID: "1021"
        deviceType: netdevice
        isRdma: true
      ---
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: cx7-eno5np0-vf2
        namespace: openshift-sriov-network-operator
      spec:
        resourceName: cx7_eno5_vf2
        numVfs: 4
        priority: 10
        nodeSelector:
          feature.node.kubernetes.io/network-sriov.capable: "true"
        nicSelector:
          pfNames: ["eno5np0#2-2"]
          vendor: "15b3"
          deviceID: "1021"
        deviceType: netdevice
        isRdma: true
      ---
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: cx7-eno5np0-vf3
        namespace: openshift-sriov-network-operator
      spec:
        resourceName: cx7_eno5_vf3
        numVfs: 4
        priority: 10
        nodeSelector:
          feature.node.kubernetes.io/network-sriov.capable: "true"
        nicSelector:
          pfNames: ["eno5np0#3-3"]
          vendor: "15b3"
          deviceID: "1021"
        deviceType: netdevice
        isRdma: true
      ---

              thaller@redhat.com Thomas Haller
              bbenshab@redhat.com Boaz Ben Shabat
              Boaz Ben Shabat
              None
              Zhiqiang Fang Zhiqiang Fang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: