Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2950

QE verification: Non-trivial drop in bitrate for pod_on_cnv2pod_on_cnv communications

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      ( ) The bug has been reproduced and verified by QE members
      ( ) Test coverage has been added to downstream CI
      ( ) For new feature, failed test plans have bugs added as children to the epic
      ( ) The bug is cloned to any relevant release that we support and/or is needed

      Show
      ( ) The bug has been reproduced and verified by QE members ( ) Test coverage has been added to downstream CI ( ) For new feature, failed test plans have bugs added as children to the epic ( ) The bug is cloned to any relevant release that we support and/or is needed
    • rhel-9
    • None

      This ticket is tracking the QE verification effort for the solution to the problem described below.

       Problem Description: Clearly explain the issue.

      Test pod2pod communications between two OCP CNV nodes reports a non-trivial drop in performance when the CNV MTU is 8900. Using iperf3, an observed bitrate around 60Mbps compared to 9200MB/s for cnv2cnv is observed.

       

      bm2bm perf is fine

      pod2pod perf is fine

      cnv2cnv perf is fine

      pod_on_cnv2pod_on_cnv is less than 1% of the three other situations.

      Adjusting the CNV MTU to 3600 improves performance to where pod_on_cnv2pod_on_cnv is 2.9Gb/s, but cnv2cnv is about 6.5Gbps.

       

      Packet captures and retis captures while reproducing the behavior reports UDP_CSUM errors for the drop, InCsumErrors in `netstat -s` output.

      Packets are not being forwarded from br-ex to the geneve interface on the CNV node receiving traffic until the correct UDP checksum is observed.

       

       Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

       

      This is impacting network performance for IBM customers using OCP on OCP with Jumbo frames in use.

       Software Versions: Specify the exact versions in use (e.g.,openvswitch3.1-3.1.0-147.el8fdp).

       

      openvswitch3.5-3.5.0-19.el9fdp (CNV)

      openvswitch3.3-3.3.0-62.el9fdp (BM)

      OCP 4.18.20 (CNV)

      OCP 4.16.32 (BM)

       

        Issue Type: Indicate whether this is a new issue or a regression (if a regression, state the last known working version).

      Not known to be a new issue. This is a new deployment.

       Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

      100% reproducibility

       Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

      Deploy BM OCP with an MTU of 9000
      Deploy CNV nodes for OCP on OCP with an MTU of 8900

      Create iperf deployment using podNetwork communications

       Expected Behavior: Describe what should happen under normal circumstances.

      bitrate of iperf3 would be in the thousands of Mb/s instead of tens.

       Observed Behavior: Explain what actually happens.

      bitrate of iperf3 is in the tens of Mb/s, (60)

       Troubleshooting Actions: Outline the steps taken to diagnose or resolve the issue so far.

      Packet packet and retis capture taken.

      netstat and /proc/net/dev review for both bare metal and cnv nodes

       Logs: If you collected logs please provide them (e.g. sos report, /var/log/openvswitch/* , testpmd console)

              ovs-qe Openvswtich Quality Engineering Bot
              rhn-support-jshivers Jacob Shivers
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: