-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.16
-
None
-
False
-
-
Description of problem:
While running batches of 500 managedclusters upgrading via Image-Based Upgrades (IBU) via RHACM and TALM, frequently the haproxy load balancer configured by default for a bare metal cluster in the openshift-kni-infra namespace would run out of connections despite being tuned for 20,000 connections.
Version-Release number of selected component (if applicable):
Hub OCP - 4.16.3 Spoke Clusters - Originally deployed 4.14.31 then upgraded in sequence to 4.14.32 -> 4.15.20 -> 4.15.21 -> 4.16.1 -> 4.16.3 ACM - 2.11.0-DOWNSTREAM-2024-07-10-21-49-48 TALM - 4.16.0
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
While monitoring the current connections during a CGU batch of 500 SNOs to IBU to a new OCP version I would observe the oc cli returning "net/http: TLS handshake timeout" and if I monitoring the current connections via rsh into the active haproxy pod: # oc -n openshift-kni-infra rsh haproxy-d16-h10-000-r650 Defaulted container "haproxy" out of: haproxy, haproxy-monitor, verify-api-int-resolvable (init) sh-5.1$ echo "show info" | socat stdio /var/lib/haproxy/run/haproxy.sock | grep CurrConns CurrConns: 20000 sh-5.1$ While capturing this value every 10 or 15 seconds I would observe a high fluctuation of the number of connections such as Thu Aug 8 17:51:57 UTC 2024 CurrConns: 17747 Thu Aug 8 17:52:02 UTC 2024 CurrConns: 18413 Thu Aug 8 17:52:07 UTC 2024 CurrConns: 19147 Thu Aug 8 17:52:12 UTC 2024 CurrConns: 19785 Thu Aug 8 17:52:18 UTC 2024 CurrConns: 20000 Thu Aug 8 17:52:23 UTC 2024 CurrConns: 20000 Thu Aug 8 17:52:28 UTC 2024 CurrConns: 20000 Thu Aug 8 17:52:33 UTC 2024 CurrConns: 20000 A brand new hub cluster without any spoke clusters and without ACM installed runs between 53-56 connections, after installing ACM I would see the connection count rise to 56-60 connections. In a smaller environment with only 297 managedclusters I observed between 1410-1695 connections. I do not have a measurement of how many approximate connections we need in the large environment however it clearly fluctuates and the initiation of the IBU upgrades seems to spike it to the current default limit triggering the timeout error message.
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update