-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.17.0
-
None
-
Quality / Stability / Reliability
-
False
-
-
1.5
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
When a pod is configured with multiple SR-IOV network attachments, the resulting logical interface names inside the pod (e.g., net1, net2, net3) are not deterministic. The mapping of these logical names to the underlying physical devices is inconsistent across different nodes in the cluster, even when the node hardware and SR-IOV configuration are identical.
Pod 1 (Master)
HCA mlx5_18 for interface net1
HCA mlx5_14 for interface net2
HCA mlx5_10 for interface net3
HCA mlx5_6 for interface net4
Pod 2 (Worker)
HCA mlx5_6 for interface net1
HCA mlx5_10 for interface net2
HCA mlx5_14 for interface net3
HCA mlx5_18 for interface net4
Master:
net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.5.3 netmask 255.255.255.0 broadcast 192.168.5.255 -- net2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.6.3 netmask 255.255.255.0 broadcast 192.168.6.255 -- net3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.7.3 netmask 255.255.255.0 broadcast 192.168.7.255 -- net4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.8.3 netmask 255.255.255.0 broadcast 192.168.8.255
Worker:
net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.5.4 netmask 255.255.255.0 broadcast 192.168.5.255 -- net2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.6.4 netmask 255.255.255.0 broadcast 192.168.6.255 -- net3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.7.4 netmask 255.255.255.0 broadcast 192.168.7.255 -- net4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.8.4 netmask 255.255.255.0 broadcast 192.168.8.255
This behavior makes running tightly-coupled distributed workloads, such as HPC applications and AI/ML training models, non-functional since NCCL is expecting HCA's will match the interfaces NICs across all involved pods/nodes
Expected Behavior
The logical interface order should be deterministic and consistent across all pods using the same annotation. The first network requested in the annotation should always become net1, the second should become net2, and so on. This would result in an identical mapping on all nodes.
exmaple for the config I used breaking down a NIC with 4 VF's to 4 resources:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: cx7-eno5np0-vf0 namespace: openshift-sriov-network-operator spec: resourceName: cx7_eno5_vf0 numVfs: 4 priority: 10 nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" nicSelector: pfNames: ["eno5np0#0-0"] vendor: "15b3" deviceID: "1021" deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: cx7-eno5np0-vf1 namespace: openshift-sriov-network-operator spec: resourceName: cx7_eno5_vf1 numVfs: 4 priority: 10 nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" nicSelector: pfNames: ["eno5np0#1-1"] vendor: "15b3" deviceID: "1021" deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: cx7-eno5np0-vf2 namespace: openshift-sriov-network-operator spec: resourceName: cx7_eno5_vf2 numVfs: 4 priority: 10 nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" nicSelector: pfNames: ["eno5np0#2-2"] vendor: "15b3" deviceID: "1021" deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: cx7-eno5np0-vf3 namespace: openshift-sriov-network-operator spec: resourceName: cx7_eno5_vf3 numVfs: 4 priority: 10 nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" nicSelector: pfNames: ["eno5np0#3-3"] vendor: "15b3" deviceID: "1021" deviceType: netdevice isRdma: true ---