Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44905

OCP 4.16: connection limits overwhelming pods after upgrade with no marked difference in request rate - connection handling failures suspected

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.16.z
    • Networking / router
    • Critical
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

         After upgrade to 4.16.12 from 4.15.32, Haproxy router pods are continually hitting maximum connection limitation and dropping connection requests. Multiple outages. Prior to upgrade, traffic flow was hovering around 13k connections - there is no new traffic load added to the cluster. Increasing the maxconn value bought some time but eventually peaked again with spikes up to max limitation value.
      
      From data available [see report in first comment below] we can see that the total established connection count is in excess of 26k associated with HaProxy processes on one affected host node during a snapshot, with 8816 unique IPs, of which only 29 of those are pod destinations. We see a ton of exit codes in State `SD` which implies we're failing to connect successfully, and (I suspect) flooding the connection totals with partially opened connections that are not able to complete their transactions with backend pods. Flow handling appears impacted.
      
      Some small highlights here (obfuscated):
      
      $ for pid in `awk '/.*haproxy/{print $2}' ps`; do echo "process $pid connections"; grep "$pid" ps | awk '{print $9}'; grep "ESTABLISHED.*$pid" netstat; done | tee haproxy-pid.out
      
      [wrussell@supportshell-1 sosreport-<name>]$ less haproxy-pid.out | grep ESTABLISHED | wc -l
      26484
      
      
      [wrussell@supportshell-1 haproxy_info]$ for i in $(ls | grep info); do echo $i; cat $i | grep -E 'CurrConns|CumConns'; done
      router-default-67cf8fdd69-2qtzm_info.out
      CurrConns: 12975
      CumConns: 74348
      router-default-67cf8fdd69-4vjs9_info.out
      CurrConns: 12853
      CumConns: 77156
      router-default-67cf8fdd69-9gsjv_info.out
      CurrConns: 12043
      CumConns: 78176
      router-default-67cf8fdd69-bjqg4_info.out
      CurrConns: 13202
      CumConns: 78022
      router-default-67cf8fdd69-kffbb_info.out
      CurrConns: 13414
      CumConns: 78346
      router-default-67cf8fdd69-q45d9_info.out
      CurrConns: 12809
      CumConns: 80010
      router-default-67cf8fdd69-rn7cs_info.out
      CurrConns: 13208
      CumConns: 79196
      router-default-67cf8fdd69-vhpmw_info.out
      CurrConns: 14290
      CumConns: 87056
      
      
      oc logs router-default-67cf8fdd69-l9x4g -c logs | awk {'print $12'} | sort | uniq -c | tee connection_closures.out
         6666 --
         3728 cD
          157 CD
         1599 sD
        11934 SD
      
      
      https://docs.haproxy.org/2.8/configuration.html#:~:text=The%20most%20common%20termination%20flags%20combinations%20are%20indicated%20below.%20They%20are
           --   Normal termination.     
      
           cD   The client did not send nor acknowledge any data for as long as the
                "timeout client" delay. This is often caused by network failures on
                the client side, or the client simply leaving the net uncleanly.     
      
      SD   The connection to the server died with an error during the data
                transfer. This usually means that HAProxy has received an RST from
                the server or an ICMP message from an intermediate equipment while
                exchanging data with the server. This can be caused by a server crash
                or by a network issue on an intermediate equipment.     
      
      sD   The server did not send nor acknowledge any data for as long as the
                "timeout server" setting during the data phase. This is often caused
                by too short timeouts on L4 equipment before the server (firewalls,
                load-balancers, ...), as well as keep-alive sessions maintained
                between the client and the server expiring first on HAProxy.
      
      
      See highlighted sample (cleaned):
      
      2024-11-08T14:53:00.598792423Z 2024-11-08T14:53:00.598673+00:00 infra-node infra-node haproxy[17725]: 10.xx.xx.228:34426 [08/Nov/2024:14:53:00.488] fe_sni~ be_edge_http:route:namespace/pod:podname:container:8080-tcp:172.xx.xx.29:8080 0/0/0/108/110/0/5 200 48686 - - ---- 56695/28275/0/0/0 0/0 hr:{00-<string>-<string>-01|} hs: "GET /v2/Notes?claimNumber=<value> HTTP/1.1" 115 1761 172.xx.xx.15 443 TLS_AES_256_GCM_SHA384 TLSv1.3
      2024-11-08T14:53:00.599935357Z 2024-11-08T14:53:00.599872+00:00 infra-node infra-node haproxy[17725]: 10.xx.xx.228:34426 [08/Nov/2024:14:53:00.479] public_ssl be_sni/fe_sni 3/0/120 49408 SD 56694/28368/28314/28314/0 0/0
      
      We successfully connect the client to HaProxy and then route that request to the backend. The call to the backend dies with state SD during the traffic transfer. 
      
      I wonder therefore if the problem is that our destination pods that haproxy is trying to negotiate with can't keep up with the traffic load, and as a result, the connections start to stack up because we're not clearing them in time... 

      Version-Release number of selected component (if applicable):

          OCP 4.16.12 (Started after upgrade from 4.15.32)
      

      How reproducible:

      Continual/ongoing after upgrade    

      Steps to Reproduce:

          1. Upgrade cluster to 4.16.12
          2. Observe maximum connection limits start to spike/reach capacity without adding additional traffic load
          3. observe in access logs tremendous uptick in unexpected closure codes + connection spikes. 
          

      Actual results:

          Connection flow should not be obstructed/stacking client connections on frontend side + filling connection limitation buffers.

      Expected results:

          traffic flow should be maintained.

      Additional info:

       See comments below for additional contexts/uploads + data points + analysis (internal).    
      haproxy-gather from: https://access.redhat.com/solutions/6987555#gather 
      traffic-breakdown by haproxy pid from: https://access.redhat.com/solutions/7082862 

              alebedev@redhat.com Andrey Lebedev
              rhn-support-wrussell Will Russell
              Ishmam Amin Ishmam Amin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: