Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62687

OCP 4.18.24: HAProxy router pods are consuming significantly more CPU/MEM than previous builds with no corresponding change in load - overwhelming cluster stability

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.18.z
    • Networking / router
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • Rejected
    • NI&D Sprint 278
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          In 4.18.24 and .22 Platform is running the following deployment stack with multiple replicas per pod. All pods are running as NodePort deployment type on GCP platform, with 1 pod per host with session affinity rules, and ExternalTrafficPolicy: Local to enforce clientIP persistence. Loadbalancer forwards traffic to the given infra node. Each pod has 1 host (no overlapping hosts).
      
      [wrussell@supportshell-2 04264539]$ oc get deployment
      NAME              READY   UP-TO-DATE   AVAILABLE   AGE
      router-default    32/32   32           32          4y
      router-external   16/16   16           16          4y
      router-pci        32/32   32           32          3y
      router-pci-2      4/4     4            4           266d
      router-pii        8/8     8            8           3y
      
      What we are seeing is that on 4.14 and 4.16 (previous versions will be confirmed when customer completes image validation testing) cluster load was HIGHER than currently being sent to the platform.
      
      Router pods were crashing continually until threads were expanded to 8 (from 4) and maxconn was expanded to 400,000 (from 50,000). Since then, pods are stable. HOWEVER, CPU utilization now is redlining (90% CPU consumption) on all nodes + router-default pods are crashing due to resource exhaustion even at 70% of the potential throughput during load tests. (Reaching 100% throughput crashes pods even at expanded values). 
      
      IdleConnectionTerminationPolicy set to IMMEDIATE was tested and resulted in a lot of i/o failures for client sessions without a corresponding reduction in MEM/CPU pressures during load test
      
      Hard-stop after was also set (with a very short window of 5m which also reaped connections, but a corresponding reduction in CPU was not observed). We also confirmed that they are averaging a process count of around 50-60, threads at around 400-500 which is less than kubelet or crio's max pid count reap threshold. 
      
      At peak load, we're expecting 4.5K connections per second, but this is LESS than previous versions/observed peak expectations and we aren't able to hit that value presently - starting to observe cpu pressure thresholds closer to 70% of that max. 
      
      nodes have 8 cpu/32Gb mem, are tainted to prevent workloads other than routers.
      
      Multiple/all routes impacted as the router pods fall over when their utilization goes too high - we've already scaled out significantly, used to run with closer to about half this deployment rate with more throughput and less CPU/mem usage, so something has significantly changed with resource consumption.
      
      
      

      Version-Release number of selected component (if applicable):

      haproxy version: Version: 2.8.10-f28885f
      ocp version: 4.18.22

      How reproducible:

          Continual - production platform is impacted. Scaling out to continually adapt to the problem is currently the workaround, customer can replicate in a lower environment with load testing.
      
      

      Steps to Reproduce:

          1. Deploy a 4.16 cluster 
          2. perform baseline load testing with multiple replica of router pods to confirm CPU/mem utilization rates
          3. Upgrade to 4.18.22 + retest - observe with no corresponding increase in Load, MEM/cpu allocation is significantly higher/threatening stability of the platform.
          

      Actual results:

          cluster is unable to handle peak load requirements + scaling demands

      Expected results:

          Haproxy resource utilization is expected to increase somewhat, but not this drastically

      Additional info:

      See first comment for data points + next-steps asked + test flow.    

              dsalerno@redhat.com Davide Salerno
              rhn-support-wrussell Will Russell
              Candace Holman, Davide Salerno
              None
              Hongan Li Hongan Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: