-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.12
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
CNF Network Sprint 271, CNF Network Sprint 280
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When using VFs from Mellanox ConnectX-6 Lx cards, there is a high percentage of packet loss.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Deploy a baremetal cluster with LCAP bonding.
2. Create SriovNetworkNodePolicy with MLX ConnectX-6 Lx cards
3. Deploy pods using VFs from the previous policy in two different nodes
4. Perform an ICMP tests between the pods using the VF interface, and you'll observer a high rate of packet loss > 50%
Actual results:
Packet loss > 50% when performing ICMP tests
Expected results:
No packet loss when performing ICMP tests
Additional info:
This cluster has a configuration similar to Verizon VCP100 with AMD EPYC 9654P 96-Core Processor and the following bonding configuration:
- bond0 (eno12399, eno12409) - ConnectX-6 Lx (25Gbps) [lacp] - machine-network (br-ex)
- bond1 (ens3f0, ens3f1) - ConnectX-6 Dx (100Gbps) [lacp]
- bond2 (ens6f0, ens6f1) - ConnectX-6 Dx (100Gbps) [active-passive]
- SRIOV VFs are created from eno12399 and eno12409 (ConnectX-6 Lx - 25Gbps)
NOTE: The same tests work in 4.14 and above (ICMP with no packet loss).
This is how SRIOV resources were prepared:
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.74 True False 7d11h Cluster version is 4.12.74 $ oc get nodes NAME STATUS ROLES AGE VERSION master-0.vcp100.partnerci.bos2.lab Ready control-plane,master 7d11h v1.25.16+1eb8682 master-1.vcp100.partnerci.bos2.lab Ready control-plane,master 7d11h v1.25.16+1eb8682 master-2.vcp100.partnerci.bos2.lab Ready control-plane,master 7d11h v1.25.16+1eb8682 worker-0.vcp100.partnerci.bos2.lab Ready worker 7d11h v1.25.16+1eb8682 worker-1.vcp100.partnerci.bos2.lab Ready worker 7d11h v1.25.16+1eb8682 worker-2.vcp100.partnerci.bos2.lab Ready worker 7d11h v1.25.16+1eb8682 worker-3.vcp100.partnerci.bos2.lab Ready worker 7d11h v1.25.16+1eb8682 $ cat sriov/sriov-policy-mlx-cx6-lx-eno12399.yml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlx-cx6-lx-eno12399-policy1 namespace: openshift-sriov-network-operator spec: deviceType: netdevice isRdma: true mtu: 9000 nicSelector: deviceID: "101f" pfNames: - eno12399#0-7 vendor: 15b3 nodeSelector: node-role.kubernetes.io/worker: "" numVfs: 8 priority: 99 resourceName: mlx_cx6_lx_eno12399_resource1 $ oc apply -f sriov/sriov-policy-mlx-cx6-lx-eno12399.yml sriovnetworknodepolicy.sriovnetwork.openshift.io/mlx-cx6-lx-eno12399-policy1 created $ cat sriov/sriov-network-mlx-cx6-lx-eno12399.yml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: mlx-cx6-lx-eno12399-net1 namespace: openshift-sriov-network-operator spec: logLevel: info networkNamespace: default resourceName: mlx_cx6_lx_eno12399_resource1 spoofChk: "off" trust: "on" vlan: 3821 capabilities: '{ "ips": true, "mac": true }' ipam: '{"type": "static"}' $ oc apply -f sriov/sriov-network-mlx-cx6-lx-eno12399.yml sriovnetwork.sriovnetwork.openshift.io/mlx-cx6-lx-eno12399-net1 created $ oc get net-attach-def NAME AGE mlx-cx6-lx-eno12399-net1 2m14s
This is how the pods were prepared
$ cat sriov/sriov-net-mlx-cx6-lx-eno12399-pod1.yml
---
apiVersion: v1
kind: Pod
metadata:
name: sriov-net-mlx-cx6-lx-eno12399-pod1
annotations:
k8s.v1.cni.cncf.io/networks: >
[
{
"name": "mlx-cx6-lx-eno12399-net1",
"mac": "00:11:22:33:44:01",
"ips": ["172.16.151.10/26"],
"namespace": "default"
}
]
cpu-load-balancing.crio.io: "disable"
cpu-quota.crio.io: "disable"
irq-load-balancing.crio.io: "disable"
spec:
nodeName: worker-0.vcp100.partnerci.bos2.lab
runtimeClassName: performance-blueprint-profile
containers:
- args:
- while true; do sleep 99999999; done;
command:
- /bin/sh
- -c
- --
image: mirror.gcr.io/wbitt/network-multitool:openshift
imagePullPolicy: Always
name: main
resources:
limits:
cpu: "2"
memory: 2Gi
hugepages-1Gi: 2Gi
requests:
cpu: "2"
memory: 2Gi
hugepages-1Gi: 2Gi
securityContext:
capabilities:
add:
- IPC_LOCK
- NET_ADMIN
- AUDIT_WRITE
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
volumes:
- name: hugepage
emptyDir:
medium: HugePages
$ oc apply -f sriov/sriov-net-mlx-cx6-lx-eno12399-pod1.yml
pod/sriov-net-mlx-cx6-lx-eno12399-pod1 created
$ cat sriov/sriov-net-mlx-cx6-lx-eno12399-pod2.yml
---
apiVersion: v1
kind: Pod
metadata:
name: sriov-net-mlx-cx6-lx-eno12399-pod2
annotations:
k8s.v1.cni.cncf.io/networks: >
[
{
"name": "mlx-cx6-lx-eno12399-net1",
"mac": "00:11:22:33:44:02",
"ips": ["172.16.151.11/26"],
"namespace": "default"
}
]
cpu-load-balancing.crio.io: "disable"
cpu-quota.crio.io: "disable"
irq-load-balancing.crio.io: "disable"
spec:
nodeName: worker-1.vcp100.partnerci.bos2.lab
runtimeClassName: performance-blueprint-profile
containers:
- args:
- while true; do sleep 99999999; done;
command:
- /bin/sh
- -c
- --
image: mirror.gcr.io/wbitt/network-multitool:openshift
imagePullPolicy: Always
name: main
resources:
limits:
cpu: "2"
memory: 2Gi
hugepages-1Gi: 2Gi
requests:
cpu: "2"
memory: 2Gi
hugepages-1Gi: 2Gi
securityContext:
capabilities:
add:
- IPC_LOCK
- NET_ADMIN
- AUDIT_WRITE
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
volumes:
- name: hugepage
emptyDir:
medium: HugePages
$ oc apply -f sriov/sriov-net-mlx-cx6-lx-eno12399-pod2.yml
pod/sriov-net-mlx-cx6-lx-eno12399-pod2 created
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
sriov-net-mlx-cx6-lx-eno12399-pod1 1/1 Running 0 53s 10.131.0.31 worker-0.vcp100.partnerci.bos2.lab <none> <none>
sriov-net-mlx-cx6-lx-eno12399-pod2 1/1 Running 0 42s 10.130.2.53 worker-1.vcp100.partnerci.bos2.lab <none>
ICMP tests between pods and to the GW (172.16.151.1) show a high percentage of packet loss:
$ oc exec -it sriov-net-mlx-cx6-lx-eno12399-pod1 -- /bin/bash sriov-net-mlx-cx6-lx-eno12399-pod1:/$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if49: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UP group default link/ether 0a:58:0a:83:00:1f brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.131.0.31/23 brd 10.131.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe83:1f/64 scope link valid_lft forever preferred_lft forever 41: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether 00:11:22:33:44:01 brd ff:ff:ff:ff:ff:ff permaddr 62:04:57:80:72:9c inet 172.16.151.10/26 brd 172.16.151.63 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::211:22ff:fe33:4401/64 scope link valid_lft forever preferred_lft forever sriov-net-mlx-cx6-lx-eno12399-pod1:/$ ping -c4 172.16.151.11 PING 172.16.151.11 (172.16.151.11) 56(84) bytes of data. From 172.16.151.10 icmp_seq=1 Destination Host Unreachable --- 172.16.151.11 ping statistics --- 4 packets transmitted, 0 received, +1 errors, 100% packet loss, time 3072ms pipe 4 sriov-net-mlx-cx6-lx-eno12399-pod1:/$ ping -c4 172.16.151.1 PING 172.16.151.1 (172.16.151.1) 56(84) bytes of data. 64 bytes from 172.16.151.1: icmp_seq=1 ttl=64 time=0.317 ms 64 bytes from 172.16.151.1: icmp_seq=4 ttl=64 time=0.242 ms --- 172.16.151.1 ping statistics --- 4 packets transmitted, 2 received, 50% packet loss, time 3111ms rtt min/avg/max/mdev = 0.242/0.279/0.317/0.037 ms $ oc exec -it sriov-net-mlx-cx6-lx-eno12399-pod2 -- /bin/bash sriov-net-mlx-cx6-lx-eno12399-pod2:/$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if47: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UP group default link/ether 0a:58:0a:82:02:35 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.130.2.53/23 brd 10.130.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe82:235/64 scope link valid_lft forever preferred_lft forever 40: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether 00:11:22:33:44:02 brd ff:ff:ff:ff:ff:ff permaddr 46:31:60:84:c6:c4 inet 172.16.151.11/26 brd 172.16.151.63 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::211:22ff:fe33:4402/64 scope link valid_lft forever preferred_lft forever sriov-net-mlx-cx6-lx-eno12399-pod2:/$ ping -c4 172.16.151.10 PING 172.16.151.10 (172.16.151.10) 56(84) bytes of data. From 172.16.151.11 icmp_seq=1 Destination Host Unreachable --- 172.16.151.10 ping statistics --- 4 packets transmitted, 0 received, +1 errors, 100% packet loss, time 3110ms pipe 4 sriov-net-mlx-cx6-lx-eno12399-pod2:/$ ping -c4 172.16.151.1 PING 172.16.151.1 (172.16.151.1) 56(84) bytes of data. 64 bytes from 172.16.151.1: icmp_seq=3 ttl=64 time=0.248 ms --- 172.16.151.1 ping statistics --- 4 packets transmitted, 1 received, 75% packet loss, time 3099ms rtt min/avg/max/mdev = 0.248/0.248/0.248/0.000 ms