Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-38490

"net/http: TLS handshake timeout" due to out of connections in haproxy in openshift-kni-infra while using ACM to Image-based Upgrade 3500+ managedclusters

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      While running batches of 500 managedclusters upgrading via Image-Based Upgrades (IBU) via RHACM and TALM, frequently the haproxy load balancer configured by default for a bare metal cluster in the openshift-kni-infra namespace would run out of connections despite being tuned for 20,000 connections. 

      Version-Release number of selected component (if applicable):

      Hub OCP - 4.16.3
      Spoke Clusters - Originally deployed 4.14.31 then upgraded in sequence to 4.14.32 -> 4.15.20 -> 4.15.21 -> 4.16.1 -> 4.16.3
      ACM - 2.11.0-DOWNSTREAM-2024-07-10-21-49-48
      TALM - 4.16.0    

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

      While monitoring the current connections during a CGU batch of 500 SNOs to IBU to a new OCP version I would observe the oc cli returning "net/http: TLS handshake timeout" and if I monitoring the current connections via rsh into the active haproxy pod:
      
      # oc  -n openshift-kni-infra rsh haproxy-d16-h10-000-r650 
      Defaulted container "haproxy" out of: haproxy, haproxy-monitor, verify-api-int-resolvable (init)
      sh-5.1$ echo "show info" | socat stdio /var/lib/haproxy/run/haproxy.sock | grep CurrConns
      CurrConns: 20000
      sh-5.1$ 
      
      While capturing this value every 10 or 15 seconds I would observe a high fluctuation of the number of connections such as 
      Thu Aug  8 17:51:57 UTC 2024
      CurrConns: 17747
      Thu Aug  8 17:52:02 UTC 2024
      CurrConns: 18413
      Thu Aug  8 17:52:07 UTC 2024
      CurrConns: 19147
      Thu Aug  8 17:52:12 UTC 2024
      CurrConns: 19785
      Thu Aug  8 17:52:18 UTC 2024
      CurrConns: 20000
      Thu Aug  8 17:52:23 UTC 2024
      CurrConns: 20000
      Thu Aug  8 17:52:28 UTC 2024
      CurrConns: 20000
      Thu Aug  8 17:52:33 UTC 2024
      CurrConns: 20000
      
      A brand new hub cluster without any spoke clusters and without ACM installed runs between 53-56 connections, after installing ACM I would see the connection count rise to 56-60 connections. In a smaller environment with only 297 managedclusters I observed between 1410-1695 connections. I do not have a measurement of how many approximate connections we need in the large environment however it clearly fluctuates and the initiation of the IBU upgrades seems to spike it to the current default limit triggering the timeout error message.

       

            mkowalsk@redhat.com Mat Kowalski
            akrzos@redhat.com Alex Krzos
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: