-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
4.16.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
Rejected
-
NE Sprint 265, NI&D Sprint 266, NI&D Sprint 267
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
After upgrade to 4.16.12 from 4.15.32, Haproxy router pods are continually hitting maximum connection limitation and dropping connection requests. Multiple outages. Prior to upgrade, traffic flow was hovering around 13k connections - there is no new traffic load added to the cluster. Increasing the maxconn value bought some time but eventually peaked again with spikes up to max limitation value.
From data available [see report in first comment below] we can see that the total established connection count is in excess of 26k associated with HaProxy processes on one affected host node during a snapshot, with 8816 unique IPs, of which only 29 of those are pod destinations. We see a ton of exit codes in State `SD` which implies we're failing to connect successfully, and (I suspect) flooding the connection totals with partially opened connections that are not able to complete their transactions with backend pods. Flow handling appears impacted.
Some small highlights here (obfuscated):
$ for pid in `awk '/.*haproxy/{print $2}' ps`; do echo "process $pid connections"; grep "$pid" ps | awk '{print $9}'; grep "ESTABLISHED.*$pid" netstat; done | tee haproxy-pid.out
[wrussell@supportshell-1 sosreport-<name>]$ less haproxy-pid.out | grep ESTABLISHED | wc -l
26484
[wrussell@supportshell-1 haproxy_info]$ for i in $(ls | grep info); do echo $i; cat $i | grep -E 'CurrConns|CumConns'; done
router-default-67cf8fdd69-2qtzm_info.out
CurrConns: 12975
CumConns: 74348
router-default-67cf8fdd69-4vjs9_info.out
CurrConns: 12853
CumConns: 77156
router-default-67cf8fdd69-9gsjv_info.out
CurrConns: 12043
CumConns: 78176
router-default-67cf8fdd69-bjqg4_info.out
CurrConns: 13202
CumConns: 78022
router-default-67cf8fdd69-kffbb_info.out
CurrConns: 13414
CumConns: 78346
router-default-67cf8fdd69-q45d9_info.out
CurrConns: 12809
CumConns: 80010
router-default-67cf8fdd69-rn7cs_info.out
CurrConns: 13208
CumConns: 79196
router-default-67cf8fdd69-vhpmw_info.out
CurrConns: 14290
CumConns: 87056
oc logs router-default-67cf8fdd69-l9x4g -c logs | awk {'print $12'} | sort | uniq -c | tee connection_closures.out
6666 --
3728 cD
157 CD
1599 sD
11934 SD
https://docs.haproxy.org/2.8/configuration.html#:~:text=The%20most%20common%20termination%20flags%20combinations%20are%20indicated%20below.%20They%20are
-- Normal termination.
cD The client did not send nor acknowledge any data for as long as the
"timeout client" delay. This is often caused by network failures on
the client side, or the client simply leaving the net uncleanly.
SD The connection to the server died with an error during the data
transfer. This usually means that HAProxy has received an RST from
the server or an ICMP message from an intermediate equipment while
exchanging data with the server. This can be caused by a server crash
or by a network issue on an intermediate equipment.
sD The server did not send nor acknowledge any data for as long as the
"timeout server" setting during the data phase. This is often caused
by too short timeouts on L4 equipment before the server (firewalls,
load-balancers, ...), as well as keep-alive sessions maintained
between the client and the server expiring first on HAProxy.
See highlighted sample (cleaned):
2024-11-08T14:53:00.598792423Z 2024-11-08T14:53:00.598673+00:00 infra-node infra-node haproxy[17725]: 10.xx.xx.228:34426 [08/Nov/2024:14:53:00.488] fe_sni~ be_edge_http:route:namespace/pod:podname:container:8080-tcp:172.xx.xx.29:8080 0/0/0/108/110/0/5 200 48686 - - ---- 56695/28275/0/0/0 0/0 hr:{00-<string>-<string>-01|} hs: "GET /v2/Notes?claimNumber=<value> HTTP/1.1" 115 1761 172.xx.xx.15 443 TLS_AES_256_GCM_SHA384 TLSv1.3
2024-11-08T14:53:00.599935357Z 2024-11-08T14:53:00.599872+00:00 infra-node infra-node haproxy[17725]: 10.xx.xx.228:34426 [08/Nov/2024:14:53:00.479] public_ssl be_sni/fe_sni 3/0/120 49408 SD 56694/28368/28314/28314/0 0/0
We successfully connect the client to HaProxy and then route that request to the backend. The call to the backend dies with state SD during the traffic transfer.
I wonder therefore if the problem is that our destination pods that haproxy is trying to negotiate with can't keep up with the traffic load, and as a result, the connections start to stack up because we're not clearing them in time...
Version-Release number of selected component (if applicable):
OCP 4.16.12 (Started after upgrade from 4.15.32)
How reproducible:
Continual/ongoing after upgrade
Steps to Reproduce:
1. Upgrade cluster to 4.16.12
2. Observe maximum connection limits start to spike/reach capacity without adding additional traffic load
3. observe in access logs tremendous uptick in unexpected closure codes + connection spikes.
Actual results:
Connection flow should not be obstructed/stacking client connections on frontend side + filling connection limitation buffers.
Expected results:
traffic flow should be maintained.
Additional info:
See comments below for additional contexts/uploads + data points + analysis (internal). haproxy-gather from: https://access.redhat.com/solutions/6987555#gather traffic-breakdown by haproxy pid from: https://access.redhat.com/solutions/7082862
- relates to
-
OCPBUGS-61858 OCP 4.16.43 - HAProxy MaxConn Limit reached/exceeded after upgrade from 4.14 with no change to workload
-
- POST
-