Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55312

packet loss in VFs from Mellanox ConnectX-6 Lx cards

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.12
    • Networking / SR-IOV
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • CNF Network Sprint 271, CNF Network Sprint 280
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When using VFs from Mellanox ConnectX-6 Lx cards, there is a high percentage of packet loss.
          

      Version-Release number of selected component (if applicable):

      4.12
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Deploy a baremetal cluster with LCAP bonding.
          2. Create SriovNetworkNodePolicy with MLX ConnectX-6 Lx cards
          3. Deploy pods using VFs from the previous policy in two different nodes
          4. Perform an ICMP tests between the pods using the VF interface, and you'll observer a high rate of packet loss > 50%
          

      Actual results:

      Packet loss > 50% when performing ICMP tests
          

      Expected results:

      No packet loss when performing ICMP tests
          

      Additional info:

      This cluster has a configuration similar to Verizon VCP100 with AMD EPYC 9654P 96-Core Processor and the following bonding configuration:
      - bond0 (eno12399, eno12409) - ConnectX-6 Lx (25Gbps) [lacp]  - machine-network (br-ex)
      - bond1 (ens3f0, ens3f1) - ConnectX-6 Dx (100Gbps) [lacp]
      - bond2 (ens6f0, ens6f1) - ConnectX-6 Dx (100Gbps) [active-passive]
      - SRIOV VFs are created from eno12399 and eno12409 (ConnectX-6 Lx - 25Gbps)
      
      NOTE: The same tests work in 4.14 and above (ICMP with no packet loss).
          

      This is how SRIOV resources were prepared:

      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.74   True        False         7d11h   Cluster version is 4.12.74
      
      $ oc get nodes
      NAME                                 STATUS   ROLES                  AGE     VERSION
      master-0.vcp100.partnerci.bos2.lab   Ready    control-plane,master   7d11h   v1.25.16+1eb8682
      master-1.vcp100.partnerci.bos2.lab   Ready    control-plane,master   7d11h   v1.25.16+1eb8682
      master-2.vcp100.partnerci.bos2.lab   Ready    control-plane,master   7d11h   v1.25.16+1eb8682
      worker-0.vcp100.partnerci.bos2.lab   Ready    worker                 7d11h   v1.25.16+1eb8682
      worker-1.vcp100.partnerci.bos2.lab   Ready    worker                 7d11h   v1.25.16+1eb8682
      worker-2.vcp100.partnerci.bos2.lab   Ready    worker                 7d11h   v1.25.16+1eb8682
      worker-3.vcp100.partnerci.bos2.lab   Ready    worker                 7d11h   v1.25.16+1eb8682
      
      $ cat sriov/sriov-policy-mlx-cx6-lx-eno12399.yml                                                                  
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: mlx-cx6-lx-eno12399-policy1
        namespace: openshift-sriov-network-operator
      spec:
        deviceType: netdevice
        isRdma: true
        mtu: 9000
        nicSelector:
          deviceID: "101f"
          pfNames:
          - eno12399#0-7
          vendor: 15b3
        nodeSelector:
          node-role.kubernetes.io/worker: ""
        numVfs: 8
        priority: 99
        resourceName: mlx_cx6_lx_eno12399_resource1
      
      $ oc apply -f sriov/sriov-policy-mlx-cx6-lx-eno12399.yml
      sriovnetworknodepolicy.sriovnetwork.openshift.io/mlx-cx6-lx-eno12399-policy1 created
      
      $ cat sriov/sriov-network-mlx-cx6-lx-eno12399.yml 
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetwork
      metadata:
        name: mlx-cx6-lx-eno12399-net1
        namespace: openshift-sriov-network-operator
      spec:
        logLevel: info
        networkNamespace: default
        resourceName: mlx_cx6_lx_eno12399_resource1
        spoofChk: "off"
        trust: "on"
        vlan: 3821
        capabilities: '{ "ips": true, "mac": true }'
        ipam: '{"type": "static"}'
      
      $ oc apply -f sriov/sriov-network-mlx-cx6-lx-eno12399.yml                                                         
      sriovnetwork.sriovnetwork.openshift.io/mlx-cx6-lx-eno12399-net1 created
      
      $ oc get net-attach-def
      NAME                       AGE
      mlx-cx6-lx-eno12399-net1   2m14s
      

      This is how the pods were prepared

      $ cat sriov/sriov-net-mlx-cx6-lx-eno12399-pod1.yml                                                                
      ---
      apiVersion: v1
      kind: Pod
      metadata:
        name: sriov-net-mlx-cx6-lx-eno12399-pod1
        annotations:
          k8s.v1.cni.cncf.io/networks: >
            [
               {
                 "name": "mlx-cx6-lx-eno12399-net1",
                 "mac": "00:11:22:33:44:01",
                 "ips": ["172.16.151.10/26"],
                 "namespace": "default"
               }
            ]
          cpu-load-balancing.crio.io: "disable"
          cpu-quota.crio.io: "disable"
          irq-load-balancing.crio.io: "disable"
      spec:
        nodeName: worker-0.vcp100.partnerci.bos2.lab
        runtimeClassName: performance-blueprint-profile
        containers:
        - args:
          - while true; do sleep 99999999; done;
          command:
          - /bin/sh
          - -c
          - --
          image: mirror.gcr.io/wbitt/network-multitool:openshift
          imagePullPolicy: Always
          name: main
          resources:
            limits:
              cpu: "2"
              memory: 2Gi
              hugepages-1Gi: 2Gi
            requests:
              cpu: "2"
              memory: 2Gi
              hugepages-1Gi: 2Gi
          securityContext:
            capabilities:
              add:
              - IPC_LOCK
              - NET_ADMIN
              - AUDIT_WRITE
          volumeMounts:
          - mountPath: /dev/hugepages
            name: hugepage
        volumes:
        - name: hugepage
          emptyDir:
            medium: HugePages
      
      $ oc apply -f sriov/sriov-net-mlx-cx6-lx-eno12399-pod1.yml 
      pod/sriov-net-mlx-cx6-lx-eno12399-pod1 created
      
      $ cat sriov/sriov-net-mlx-cx6-lx-eno12399-pod2.yml                                                                
      ---
      apiVersion: v1
      kind: Pod
      metadata:
        name: sriov-net-mlx-cx6-lx-eno12399-pod2
        annotations:
          k8s.v1.cni.cncf.io/networks: >
            [
               {
                 "name": "mlx-cx6-lx-eno12399-net1",
                 "mac": "00:11:22:33:44:02",
                 "ips": ["172.16.151.11/26"],
                 "namespace": "default"
               }
            ]
          cpu-load-balancing.crio.io: "disable"
          cpu-quota.crio.io: "disable"
          irq-load-balancing.crio.io: "disable"
      spec:
        nodeName: worker-1.vcp100.partnerci.bos2.lab
        runtimeClassName: performance-blueprint-profile
        containers:
        - args:
          - while true; do sleep 99999999; done;
          command:
          - /bin/sh
          - -c
          - --
          image: mirror.gcr.io/wbitt/network-multitool:openshift
          imagePullPolicy: Always
          name: main
          resources:
            limits:
              cpu: "2"
              memory: 2Gi
              hugepages-1Gi: 2Gi
            requests:
              cpu: "2"
              memory: 2Gi
              hugepages-1Gi: 2Gi
          securityContext:
            capabilities:
              add:
              - IPC_LOCK
              - NET_ADMIN
              - AUDIT_WRITE
          volumeMounts:
          - mountPath: /dev/hugepages
            name: hugepage
        volumes:
        - name: hugepage
          emptyDir:
            medium: HugePages
      
      $ oc apply -f sriov/sriov-net-mlx-cx6-lx-eno12399-pod2.yml                                               
      pod/sriov-net-mlx-cx6-lx-eno12399-pod2 created
      
      $ oc get pods -o wide
      NAME                                 READY   STATUS    RESTARTS   AGE   IP            NODE                                 NOMINATED NODE   READINESS GATES
      sriov-net-mlx-cx6-lx-eno12399-pod1   1/1     Running   0          53s   10.131.0.31   worker-0.vcp100.partnerci.bos2.lab   <none>           <none>
      sriov-net-mlx-cx6-lx-eno12399-pod2   1/1     Running   0          42s   10.130.2.53   worker-1.vcp100.partnerci.bos2.lab   <none>
      

      ICMP tests between pods and to the GW (172.16.151.1) show a high percentage of packet loss:

      $ oc exec -it sriov-net-mlx-cx6-lx-eno12399-pod1 -- /bin/bash
      sriov-net-mlx-cx6-lx-eno12399-pod1:/$ ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host 
             valid_lft forever preferred_lft forever
      2: eth0@if49: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UP group default 
          link/ether 0a:58:0a:83:00:1f brd ff:ff:ff:ff:ff:ff link-netnsid 0
          inet 10.131.0.31/23 brd 10.131.1.255 scope global eth0
             valid_lft forever preferred_lft forever
          inet6 fe80::858:aff:fe83:1f/64 scope link 
             valid_lft forever preferred_lft forever
      41: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
          link/ether 00:11:22:33:44:01 brd ff:ff:ff:ff:ff:ff permaddr 62:04:57:80:72:9c
          inet 172.16.151.10/26 brd 172.16.151.63 scope global net1
             valid_lft forever preferred_lft forever
          inet6 fe80::211:22ff:fe33:4401/64 scope link 
             valid_lft forever preferred_lft forever
      
      sriov-net-mlx-cx6-lx-eno12399-pod1:/$ ping -c4 172.16.151.11
      PING 172.16.151.11 (172.16.151.11) 56(84) bytes of data.
      From 172.16.151.10 icmp_seq=1 Destination Host Unreachable
      --- 172.16.151.11 ping statistics ---
      4 packets transmitted, 0 received, +1 errors, 100% packet loss, time 3072ms
      pipe 4
      
      sriov-net-mlx-cx6-lx-eno12399-pod1:/$ ping -c4 172.16.151.1
      PING 172.16.151.1 (172.16.151.1) 56(84) bytes of data.
      64 bytes from 172.16.151.1: icmp_seq=1 ttl=64 time=0.317 ms
      64 bytes from 172.16.151.1: icmp_seq=4 ttl=64 time=0.242 ms
      --- 172.16.151.1 ping statistics ---
      4 packets transmitted, 2 received, 50% packet loss, time 3111ms
      rtt min/avg/max/mdev = 0.242/0.279/0.317/0.037 ms
      
      $ oc exec -it sriov-net-mlx-cx6-lx-eno12399-pod2 -- /bin/bash
      sriov-net-mlx-cx6-lx-eno12399-pod2:/$ ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host 
             valid_lft forever preferred_lft forever
      2: eth0@if47: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UP group default 
          link/ether 0a:58:0a:82:02:35 brd ff:ff:ff:ff:ff:ff link-netnsid 0
          inet 10.130.2.53/23 brd 10.130.3.255 scope global eth0
             valid_lft forever preferred_lft forever
          inet6 fe80::858:aff:fe82:235/64 scope link 
             valid_lft forever preferred_lft forever
      40: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
          link/ether 00:11:22:33:44:02 brd ff:ff:ff:ff:ff:ff permaddr 46:31:60:84:c6:c4
          inet 172.16.151.11/26 brd 172.16.151.63 scope global net1
             valid_lft forever preferred_lft forever
          inet6 fe80::211:22ff:fe33:4402/64 scope link 
             valid_lft forever preferred_lft forever
      
      sriov-net-mlx-cx6-lx-eno12399-pod2:/$ ping -c4 172.16.151.10
      PING 172.16.151.10 (172.16.151.10) 56(84) bytes of data.
      From 172.16.151.11 icmp_seq=1 Destination Host Unreachable
      --- 172.16.151.10 ping statistics ---
      4 packets transmitted, 0 received, +1 errors, 100% packet loss, time 3110ms
      pipe 4
      
      sriov-net-mlx-cx6-lx-eno12399-pod2:/$ ping -c4 172.16.151.1
      PING 172.16.151.1 (172.16.151.1) 56(84) bytes of data.
      64 bytes from 172.16.151.1: icmp_seq=3 ttl=64 time=0.248 ms
      --- 172.16.151.1 ping statistics ---
      4 packets transmitted, 1 received, 75% packet loss, time 3099ms
      rtt min/avg/max/mdev = 0.248/0.248/0.248/0.000 ms
      

              sscheink@redhat.com Sebastian Scheinkman
              rhn-gps-manrodri Manuel Rodriguez (Inactive)
              None
              None
              Zhiqiang Fang Zhiqiang Fang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: