Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-958

OCP 4.14 - OVS upcall handling issue at load scale from multiple different client tuples

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • openvswitch3.3
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Given a system that needs to handle high volumes of requests from thousands of different IP sources with OVS, 

      When the system receives a sustained load of requests exceeding 75000 per second with varied source IPs, 

      Then OVS should handle the traffic without generating excessive upcalls that would cause CPU overload and failure of health checks for the pods.

      Show
      Given a system that needs to handle high volumes of requests from thousands of different IP sources with OVS,  When the system receives a sustained load of requests exceeding 75000 per second with varied source IPs,  Then OVS should handle the traffic without generating excessive upcalls that would cause CPU overload and failure of health checks for the pods.
    • rhel-sst-network-fastdatapath
    • ssg_networking
    • Important

      //ISSUE:

      Customer has a product application that is running on two well-provisioned worker nodes. This application is exposed, using metalLB and a load test is being sent using UDP traffic that exceeds a rate of 75k qps

      They are aiming to achieve a rate of 120k requests per second and observe failures before they hit that period. The backends of the application (and multiple other applications on the nodes being hammered by request rate) start to fail their local kubelet --> pod liveness probes.

      Health checks begin to fail, and pods are marked as not/ready. This removes the target application from the MetalLB as a valid backend and causes a flapping behavior on all pods on the hosts.

       

      This is a continuance of: https://issues.redhat.com/browse/FDP-579 which was addressed/resolved with an increased conntrack bucket size embed + a recommended upgrade to 4.16. However, we are revisiting this problem because it has been observed that our previous problem replicator was using only a few localized IP addresses/tuples in it's requests to the nodes. When testing in production,  we observed that we immediately overwhelm OVS on the nodes, and the primary difference is that we have THOUSANDS of unique IP's/tuples establishing these requests, relative to the 5 or so localized IP's from the same load-test we were running internally.

       

      We modified the local replicator to have several thousand IP addresses and immediately replicated the behavior. 

       

      Local nodes still have a conntrack bucket override to ensure we aren't hitting the problem outlined previously. This behavior appears to be the same symptoms, but networking team believes it is probable we are making a new upcall request for each new tuple and it can't keep up. 

       

      Need assistance reviewing available data samples. 

       

      4.14.14, OVN-kubernetes.

      traffic flow, details and case information in first comment (internal).

              rh-ee-mpattric Mike Pattrick
              rhn-support-wrussell Will Russell
              Aaron Conole
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: