Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61858

OCP 4.16.43 - HAProxy MaxConn Limit reached/exceeded after upgrade from 4.14 with no change to workload

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.16.z
    • Networking / router
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      After upgrading 4.14 -> 4.16.43 HAProxy pods constantly reconciled in CrashLoop State. The HAProxy socket stats showed that we did hit the 50k Limit over and over and increasing infra nodes. 
      
      Customer data indicates that the maxConn Limit is met and exceeded repeatedly - extending the maxconn value significantly did alleviate the pressure condition, and adding additional router pods was necessary to alleviate strain + impact state. 
      
      Workaround is in place - reverted the haproxy-router image back to 4.14.48.
      
      We suspect and suggested that idle-close-on-response is our primary driver for this problem and have advised an update to 4.16.44, however there is a good question pending regarding build changes and impact here that we need to clarify:
      
      //QUERY regarding versioning 
      OpenShift 4.14.48 uses HAProxy 2.6 with "idle-close-on-response" per default in haproxy.config       # no connections piling up
      OpenShift 4.16.43 uses HAProxy 2.8 with "idle-close-on-response" per default in haproxy.config      # connections piling up
      OpenShift 4.16.44(+) uses HAProxy 2.8 via "idleConnectionTerminationPolicy: deferred" per default -> "idle-close-on-response" in haproxy.config # should be the same as 4.16.43 (untested).
      
      
      //Data regarding idle-close-on-response and flag idleConnectionTerminationPolicy from https://redhat-internal.slack.com/archives/CCH60A77E/p1757567064997529 
      
           Unconditional idle-close-on-response flag was added in 4.14.23 (4-14 bug)https://issues.redhat.com/browse/OCPBUGS-32437.
      
          Versions before 4.14.0 -> 4.14.22 did not have idle-close-on-response so HAProxies were closing the idle connection immediately.  
          Starting from 4.14.23 the idle connections are kept open for old HAProxy processes until the last request-response is done.
      
          IdleConnectionTerminationPolicy field was backported to 4.16.44 (4-16 bug) https://issues.redhat.com/browse/OCPBUGS-56424.
      
          So, starting from 4.16.44 it's possible to opt out from idle-close-on-response behavior by setting Immediate as a value.  idle-close-on-response option keeps idle connection open for old HAproxy processes until the last request is received and a response for it is sent back. We get a new HAproxy process for each reload which is what we do to update HAproxy configuration (new routes/deleted routes/endpoint updates). This adds up to the total number of connections.
      
      If idle-close-on-response is on old processes did not terminate idle connection until the last request is received or until the idle timeout is expired (~ 5 mins).
       
      
      //IMPACT: production cluster is currently stable WITH WORKAROUND (rollback version) - cannot stay this way indefinitely in a supported state.

      Version-Release number of selected component (if applicable):

          4.16.43

      How reproducible:

          Single environment

      Steps to Reproduce:

          1. Internal replicator pending
          2.
          3.
          

      Actual results:

      Environment is overwhelmed - see graph attachment in first comment attached to issue (internal) - maxconn limit is reached very quickly and connections do not close. Total concurrent connection count goes up 5-10x relative to baseline.    

      Expected results:

          Cluster platform stability should be maintained - router pods should not hold on to sessions indefinitely and maxConn should not be reached without a corresponding increase in throughput.

      Additional info:

          Attachments and data details will be shared in first comment below for analysis + feedback.
      Customer platform is still impacted and is stable only via rollback workaround. Data to support and confirm 4.16.44 + with idleConnectionTerminationPolicy enabled is required.

              alebedev@redhat.com Andrey Lebedev
              rhn-support-wrussell Will Russell
              None
              None
              Shudi Li Shudi Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: