Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-781

[backport 4.15] OCP 4.14.14 - UDP traffic flood leading to TCP packet loss/application health probe failures (lost syns from kubelet to pod)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • openvswitch3.3
    • False
    • Hide

      None

      Show
      None
    • False
    • rhel-sst-network-fastdatapath
    • ssg_networking
    • Important

      Description of problem:

      Customer has a product application that is running on two well-provisioned worker nodes. This application is exposed, using metalLB and a load test is being sent using UDP traffic that exceeds a rate of 30k requests per second.
      
      They are aiming to achieve a rate of 120k requests per second and observe failures before they hit that period.
      
      The backends of the application (and multiple other applications on the nodes being hammered by request rate) start to fail their local kubelet --> pod liveness probes. Health checks begin to fail, and pods are marked as not/ready. This removes the target application from the MetalLB as a valid backend and causes a flapping behavior.
      
      pods on the node including DNS and ElasticSearch/(external namespaces/services just hosted on the same node) are impacted.
      
      We see this issue any time we fire sufficient load at the service.    
      
      traffic flow:
      client --> [external network/firewall/LBs] --> BGP speaker pod [1] DNAT --> NODE (host) [br-ex] via [bond1.2039] --> br-int --> [logical gateways/switches for OVN] --> --> eth0 (pod)

      Version-Release number of selected component (if applicable):

          OCP 4.14.14
      
      bare-metal cluster

      How reproducible:

      every time    

      Steps to Reproduce:

          1. Send high rate of request at target application BGP speaker address (UDP traffic)
          2. Observe when traffic hits around 30k requests per second we start to observe instability on the pods/services, traffic flood is preventing pod's from being able to self-query for health probes
          3. observe syns dropped from requests to the backends (tcp capture on node shows syn retries), tcp capture on pod shows syn never arrived. Dropped in-transit via OVN gateways.
          

      Actual results:

      traffic is rate-limited to 30k requests/s before node networking degrades, cluster cannot handle desired workload    

      Expected results:

          traffic should not be rate-limited, UDP traffic should not interfere with TCP handshaking (or if it should, needs documentation/express limit validations).

      Additional info:

         See first comment below (private note) with full detail + collaboration request.

       

       

      //UPDATE: This appears to be due to conntrack bucket size limitation leading to dropped calls due to chaintoolong exceptions.

      Detailed in KCS: https://access.redhat.com/solutions/7073555 

      pending upstream kernel testing

              aconole@redhat.com Aaron Conole
              rhn-support-wrussell Will Russell
              Aaron Conole
              Jaime Caamaño Ruiz
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: