Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-68378

[release-4.19] OCP 4.16.43 - HAProxy MaxConn Limit reached/exceeded after upgrade from 4.14 with no change to workload

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 2
    • Critical
    • None
    • Rejected
    • NI&D Sprint 281, NI&D Sprint 283, NI&D Sprint 282
    • 3
    • +
    • Proposed
    • Bug Fix
    • Hide
      Before this update, [product-title} 4.16 and later versions failed to respect the timeout `http-keep-alive` setting due to a known upstream HAProxy bug, preventing users from effectively managing connection persistence. This lack of control resulted in inconsistent connection behaviors, where long-lived sessions might be terminated unexpectedly or held open longer than normal. With this release, the `HTTPKeepAliveTimeout` tuning option has been integrated into the `IngressController` API, providing a formal way for customers to configure and enforce this specific timeout. As a result, cluster administrators now possess the granular control necessary to align connection persistence with specific application needs. (link:https://issues.redhat.com/browse/OCPBUGS-68378[OCPBUGS-68378])
      Show
      Before this update, [product-title} 4.16 and later versions failed to respect the timeout `http-keep-alive` setting due to a known upstream HAProxy bug, preventing users from effectively managing connection persistence. This lack of control resulted in inconsistent connection behaviors, where long-lived sessions might be terminated unexpectedly or held open longer than normal. With this release, the `HTTPKeepAliveTimeout` tuning option has been integrated into the `IngressController` API, providing a formal way for customers to configure and enforce this specific timeout. As a result, cluster administrators now possess the granular control necessary to align connection persistence with specific application needs. (link: https://issues.redhat.com/browse/OCPBUGS-68378 [ OCPBUGS-68378 ])

      Description of problem:

      After upgrading 4.14 -> 4.16.43 HAProxy pods constantly reconciled in CrashLoop State. The HAProxy socket stats showed that we did hit the 50k Limit over and over and increasing infra nodes. 
      
      Customer data indicates that the maxConn Limit is met and exceeded repeatedly - extending the maxconn value significantly did alleviate the pressure condition, and adding additional router pods was necessary to alleviate strain + impact state. 
      
      Workaround is in place - reverted the haproxy-router image back to 4.14.48.
      
      We suspect and suggested that idle-close-on-response is our primary driver for this problem and have advised an update to 4.16.44, however there is a good question pending regarding build changes and impact here that we need to clarify:
      
      //QUERY regarding versioning 
      OpenShift 4.14.48 uses HAProxy 2.6 with "idle-close-on-response" per default in haproxy.config       # no connections piling up
      OpenShift 4.16.43 uses HAProxy 2.8 with "idle-close-on-response" per default in haproxy.config      # connections piling up
      OpenShift 4.16.44(+) uses HAProxy 2.8 via "idleConnectionTerminationPolicy: deferred" per default -> "idle-close-on-response" in haproxy.config # should be the same as 4.16.43 (untested).
      
      
      //Data regarding idle-close-on-response and flag idleConnectionTerminationPolicy from https://redhat-internal.slack.com/archives/CCH60A77E/p1757567064997529 
      
           Unconditional idle-close-on-response flag was added in 4.14.23 (4-14 bug)https://issues.redhat.com/browse/OCPBUGS-32437.
      
          Versions before 4.14.0 -> 4.14.22 did not have idle-close-on-response so HAProxies were closing the idle connection immediately.  
          Starting from 4.14.23 the idle connections are kept open for old HAProxy processes until the last request-response is done.
      
          IdleConnectionTerminationPolicy field was backported to 4.16.44 (4-16 bug) https://issues.redhat.com/browse/OCPBUGS-56424.
      
          So, starting from 4.16.44 it's possible to opt out from idle-close-on-response behavior by setting Immediate as a value.  idle-close-on-response option keeps idle connection open for old HAproxy processes until the last request is received and a response for it is sent back. We get a new HAproxy process for each reload which is what we do to update HAproxy configuration (new routes/deleted routes/endpoint updates). This adds up to the total number of connections.
      
      If idle-close-on-response is on old processes did not terminate idle connection until the last request is received or until the idle timeout is expired (~ 5 mins).
       
      
      //IMPACT: production cluster is currently stable WITH WORKAROUND (rollback version) - cannot stay this way indefinitely in a supported state.

      Version-Release number of selected component (if applicable):

          4.16.43

      How reproducible:

          Single environment

      Steps to Reproduce:

          1. Internal replicator pending
          2.
          3.
          

      Actual results:

      Environment is overwhelmed - see graph attachment in first comment attached to issue (internal) - maxconn limit is reached very quickly and connections do not close. Total concurrent connection count goes up 5-10x relative to baseline.    

      Expected results:

          Cluster platform stability should be maintained - router pods should not hold on to sessions indefinitely and maxConn should not be reached without a corresponding increase in throughput.

      Additional info:

          Attachments and data details will be shared in first comment below for analysis + feedback.
      Customer platform is still impacted and is stable only via rollback workaround. Data to support and confirm 4.16.44 + with idleConnectionTerminationPolicy enabled is required.

              alebedev@redhat.com Andrey Lebedev
              rhn-support-wrussell Will Russell
              None
              None
              Shudi Li Shudi Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: