Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Machine Config Operator / platform-baremetal
Labels:

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:
Target Version:

4.18.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

While running batches of 500 managedclusters upgrading via Image-Based Upgrades (IBU) via RHACM and TALM, frequently the haproxy load balancer configured by default for a bare metal cluster in the openshift-kni-infra namespace would run out of connections despite being tuned for 20,000 connections.

Version-Release number of selected component (if applicable):

Hub OCP - 4.16.3
Spoke Clusters - Originally deployed 4.14.31 then upgraded in sequence to 4.14.32 -> 4.15.20 -> 4.15.21 -> 4.16.1 -> 4.16.3
ACM - 2.11.0-DOWNSTREAM-2024-07-10-21-49-48
TALM - 4.16.0

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

While monitoring the current connections during a CGU batch of 500 SNOs to IBU to a new OCP version I would observe the oc cli returning "net/http: TLS handshake timeout" and if I monitoring the current connections via rsh into the active haproxy pod:

# oc  -n openshift-kni-infra rsh haproxy-d16-h10-000-r650 
Defaulted container "haproxy" out of: haproxy, haproxy-monitor, verify-api-int-resolvable (init)
sh-5.1$ echo "show info" | socat stdio /var/lib/haproxy/run/haproxy.sock | grep CurrConns
CurrConns: 20000
sh-5.1$ 

While capturing this value every 10 or 15 seconds I would observe a high fluctuation of the number of connections such as 
Thu Aug  8 17:51:57 UTC 2024
CurrConns: 17747
Thu Aug  8 17:52:02 UTC 2024
CurrConns: 18413
Thu Aug  8 17:52:07 UTC 2024
CurrConns: 19147
Thu Aug  8 17:52:12 UTC 2024
CurrConns: 19785
Thu Aug  8 17:52:18 UTC 2024
CurrConns: 20000
Thu Aug  8 17:52:23 UTC 2024
CurrConns: 20000
Thu Aug  8 17:52:28 UTC 2024
CurrConns: 20000
Thu Aug  8 17:52:33 UTC 2024
CurrConns: 20000

A brand new hub cluster without any spoke clusters and without ACM installed runs between 53-56 connections, after installing ACM I would see the connection count rise to 56-60 connections. In a smaller environment with only 297 managedclusters I observed between 1410-1695 connections. I do not have a measurement of how many approximate connections we need in the large environment however it clearly fluctuates and the initiation of the IBU upgrades seems to spike it to the current default limit triggering the timeout error message.

links to

openshift/machine-config-operator#4531: OCPBUGS-38490: Increase connection limit for cluster loadbalancer

RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update

Assignee:: Mat Kowalski

Reporter:: Alex Krzos

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/08/14 3:41 PM

Updated:: 2025/02/25 4:46 AM

Resolved:: 2025/02/25 4:46 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates