-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.16.z
-
Critical
-
None
-
Rejected
-
False
-
Description of problem:
After upgrade to 4.16.12 from 4.15.32, Haproxy router pods are continually hitting maximum connection limitation and dropping connection requests. Multiple outages. Prior to upgrade, traffic flow was hovering around 13k connections - there is no new traffic load added to the cluster. Increasing the maxconn value bought some time but eventually peaked again with spikes up to max limitation value. From data available [see report in first comment below] we can see that the total established connection count is in excess of 26k associated with HaProxy processes on one affected host node during a snapshot, with 8816 unique IPs, of which only 29 of those are pod destinations. We see a ton of exit codes in State `SD` which implies we're failing to connect successfully, and (I suspect) flooding the connection totals with partially opened connections that are not able to complete their transactions with backend pods. Flow handling appears impacted. Some small highlights here (obfuscated): $ for pid in `awk '/.*haproxy/{print $2}' ps`; do echo "process $pid connections"; grep "$pid" ps | awk '{print $9}'; grep "ESTABLISHED.*$pid" netstat; done | tee haproxy-pid.out [wrussell@supportshell-1 sosreport-<name>]$ less haproxy-pid.out | grep ESTABLISHED | wc -l 26484 [wrussell@supportshell-1 haproxy_info]$ for i in $(ls | grep info); do echo $i; cat $i | grep -E 'CurrConns|CumConns'; done router-default-67cf8fdd69-2qtzm_info.out CurrConns: 12975 CumConns: 74348 router-default-67cf8fdd69-4vjs9_info.out CurrConns: 12853 CumConns: 77156 router-default-67cf8fdd69-9gsjv_info.out CurrConns: 12043 CumConns: 78176 router-default-67cf8fdd69-bjqg4_info.out CurrConns: 13202 CumConns: 78022 router-default-67cf8fdd69-kffbb_info.out CurrConns: 13414 CumConns: 78346 router-default-67cf8fdd69-q45d9_info.out CurrConns: 12809 CumConns: 80010 router-default-67cf8fdd69-rn7cs_info.out CurrConns: 13208 CumConns: 79196 router-default-67cf8fdd69-vhpmw_info.out CurrConns: 14290 CumConns: 87056 oc logs router-default-67cf8fdd69-l9x4g -c logs | awk {'print $12'} | sort | uniq -c | tee connection_closures.out 6666 -- 3728 cD 157 CD 1599 sD 11934 SD https://docs.haproxy.org/2.8/configuration.html#:~:text=The%20most%20common%20termination%20flags%20combinations%20are%20indicated%20below.%20They%20are -- Normal termination. cD The client did not send nor acknowledge any data for as long as the "timeout client" delay. This is often caused by network failures on the client side, or the client simply leaving the net uncleanly. SD The connection to the server died with an error during the data transfer. This usually means that HAProxy has received an RST from the server or an ICMP message from an intermediate equipment while exchanging data with the server. This can be caused by a server crash or by a network issue on an intermediate equipment. sD The server did not send nor acknowledge any data for as long as the "timeout server" setting during the data phase. This is often caused by too short timeouts on L4 equipment before the server (firewalls, load-balancers, ...), as well as keep-alive sessions maintained between the client and the server expiring first on HAProxy. See highlighted sample (cleaned): 2024-11-08T14:53:00.598792423Z 2024-11-08T14:53:00.598673+00:00 infra-node infra-node haproxy[17725]: 10.xx.xx.228:34426 [08/Nov/2024:14:53:00.488] fe_sni~ be_edge_http:route:namespace/pod:podname:container:8080-tcp:172.xx.xx.29:8080 0/0/0/108/110/0/5 200 48686 - - ---- 56695/28275/0/0/0 0/0 hr:{00-<string>-<string>-01|} hs: "GET /v2/Notes?claimNumber=<value> HTTP/1.1" 115 1761 172.xx.xx.15 443 TLS_AES_256_GCM_SHA384 TLSv1.3 2024-11-08T14:53:00.599935357Z 2024-11-08T14:53:00.599872+00:00 infra-node infra-node haproxy[17725]: 10.xx.xx.228:34426 [08/Nov/2024:14:53:00.479] public_ssl be_sni/fe_sni 3/0/120 49408 SD 56694/28368/28314/28314/0 0/0 We successfully connect the client to HaProxy and then route that request to the backend. The call to the backend dies with state SD during the traffic transfer. I wonder therefore if the problem is that our destination pods that haproxy is trying to negotiate with can't keep up with the traffic load, and as a result, the connections start to stack up because we're not clearing them in time...
Version-Release number of selected component (if applicable):
OCP 4.16.12 (Started after upgrade from 4.15.32)
How reproducible:
Continual/ongoing after upgrade
Steps to Reproduce:
1. Upgrade cluster to 4.16.12 2. Observe maximum connection limits start to spike/reach capacity without adding additional traffic load 3. observe in access logs tremendous uptick in unexpected closure codes + connection spikes.
Actual results:
Connection flow should not be obstructed/stacking client connections on frontend side + filling connection limitation buffers.
Expected results:
traffic flow should be maintained.
Additional info:
See comments below for additional contexts/uploads + data points + analysis (internal). haproxy-gather from: https://access.redhat.com/solutions/6987555#gather traffic-breakdown by haproxy pid from: https://access.redhat.com/solutions/7082862