Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Networking / router
Labels:
- ne-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
NE Sprint 265, NI&D Sprint 266, NI&D Sprint 267
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:
PX Technical Impact:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

   After upgrade to 4.16.12 from 4.15.32, Haproxy router pods are continually hitting maximum connection limitation and dropping connection requests. Multiple outages. Prior to upgrade, traffic flow was hovering around 13k connections - there is no new traffic load added to the cluster. Increasing the maxconn value bought some time but eventually peaked again with spikes up to max limitation value.

From data available [see report in first comment below] we can see that the total established connection count is in excess of 26k associated with HaProxy processes on one affected host node during a snapshot, with 8816 unique IPs, of which only 29 of those are pod destinations. We see a ton of exit codes in State `SD` which implies we're failing to connect successfully, and (I suspect) flooding the connection totals with partially opened connections that are not able to complete their transactions with backend pods. Flow handling appears impacted.

Some small highlights here (obfuscated):

$ for pid in `awk '/.*haproxy/{print $2}' ps`; do echo "process $pid connections"; grep "$pid" ps | awk '{print $9}'; grep "ESTABLISHED.*$pid" netstat; done | tee haproxy-pid.out

[wrussell@supportshell-1 sosreport-<name>]$ less haproxy-pid.out | grep ESTABLISHED | wc -l
26484


[wrussell@supportshell-1 haproxy_info]$ for i in $(ls | grep info); do echo $i; cat $i | grep -E 'CurrConns|CumConns'; done
router-default-67cf8fdd69-2qtzm_info.out
CurrConns: 12975
CumConns: 74348
router-default-67cf8fdd69-4vjs9_info.out
CurrConns: 12853
CumConns: 77156
router-default-67cf8fdd69-9gsjv_info.out
CurrConns: 12043
CumConns: 78176
router-default-67cf8fdd69-bjqg4_info.out
CurrConns: 13202
CumConns: 78022
router-default-67cf8fdd69-kffbb_info.out
CurrConns: 13414
CumConns: 78346
router-default-67cf8fdd69-q45d9_info.out
CurrConns: 12809
CumConns: 80010
router-default-67cf8fdd69-rn7cs_info.out
CurrConns: 13208
CumConns: 79196
router-default-67cf8fdd69-vhpmw_info.out
CurrConns: 14290
CumConns: 87056


oc logs router-default-67cf8fdd69-l9x4g -c logs | awk {'print $12'} | sort | uniq -c | tee connection_closures.out
   6666 --
   3728 cD
    157 CD
   1599 sD
  11934 SD


https://docs.haproxy.org/2.8/configuration.html#:~:text=The%20most%20common%20termination%20flags%20combinations%20are%20indicated%20below.%20They%20are
     --   Normal termination.     

     cD   The client did not send nor acknowledge any data for as long as the
          "timeout client" delay. This is often caused by network failures on
          the client side, or the client simply leaving the net uncleanly.     

SD   The connection to the server died with an error during the data
          transfer. This usually means that HAProxy has received an RST from
          the server or an ICMP message from an intermediate equipment while
          exchanging data with the server. This can be caused by a server crash
          or by a network issue on an intermediate equipment.     

sD   The server did not send nor acknowledge any data for as long as the
          "timeout server" setting during the data phase. This is often caused
          by too short timeouts on L4 equipment before the server (firewalls,
          load-balancers, ...), as well as keep-alive sessions maintained
          between the client and the server expiring first on HAProxy.


See highlighted sample (cleaned):

2024-11-08T14:53:00.598792423Z 2024-11-08T14:53:00.598673+00:00 infra-node infra-node haproxy[17725]: 10.xx.xx.228:34426 [08/Nov/2024:14:53:00.488] fe_sni~ be_edge_http:route:namespace/pod:podname:container:8080-tcp:172.xx.xx.29:8080 0/0/0/108/110/0/5 200 48686 - - ---- 56695/28275/0/0/0 0/0 hr:{00-<string>-<string>-01|} hs: "GET /v2/Notes?claimNumber=<value> HTTP/1.1" 115 1761 172.xx.xx.15 443 TLS_AES_256_GCM_SHA384 TLSv1.3
2024-11-08T14:53:00.599935357Z 2024-11-08T14:53:00.599872+00:00 infra-node infra-node haproxy[17725]: 10.xx.xx.228:34426 [08/Nov/2024:14:53:00.479] public_ssl be_sni/fe_sni 3/0/120 49408 SD 56694/28368/28314/28314/0 0/0

We successfully connect the client to HaProxy and then route that request to the backend. The call to the backend dies with state SD during the traffic transfer. 

I wonder therefore if the problem is that our destination pods that haproxy is trying to negotiate with can't keep up with the traffic load, and as a result, the connections start to stack up because we're not clearing them in time...

Version-Release number of selected component (if applicable):

    OCP 4.16.12 (Started after upgrade from 4.15.32)

How reproducible:

Continual/ongoing after upgrade

Steps to Reproduce:

    1. Upgrade cluster to 4.16.12
    2. Observe maximum connection limits start to spike/reach capacity without adding additional traffic load
    3. observe in access logs tremendous uptick in unexpected closure codes + connection spikes.

Actual results:

    Connection flow should not be obstructed/stacking client connections on frontend side + filling connection limitation buffers.

Expected results:

    traffic flow should be maintained.

Additional info:

 See comments below for additional contexts/uploads + data points + analysis (internal).    
haproxy-gather from: https://access.redhat.com/solutions/6987555#gather 
traffic-breakdown by haproxy pid from: https://access.redhat.com/solutions/7082862

relates to

OCPBUGS-61858 OCP 4.16.43 - HAProxy MaxConn Limit reached/exceeded after upgrade from 4.14 with no change to workload

Assignee:: Andrey Lebedev

Reporter:: Will Russell

Need Info From:: None

Contributors:: None

QA Contact:: Ishmam Amin

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/11/22 2:58 PM

Updated:: 2025/09/18 2:49 PM

Resolved:: 2025/07/31 2:16 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide