Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 4.19.z
Affects Version/s: 4.16.z
Component/s: Networking / router
Labels:
- ne-triaged
- pre-merge-verify

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
2
Severity:
Critical
Regression:
None

Target Backport Versions:

4.16, 4.17, 4.18, 4.19, 4.20
Target Version:

4.19.z
Release Blocker:
Rejected
Sprint:
NI&D Sprint 281, NI&D Sprint 283, NI&D Sprint 282
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Test Coverage:

+

PX Review Complete:
PX Impact Score:

Release Note Status:
Proposed
Release Note Type:
Bug Fix
Release Note Text:

Hide
Before this update, [product-title} 4.16 and later versions failed to respect the timeout `http-keep-alive` setting due to a known upstream HAProxy bug, preventing users from effectively managing connection persistence. This lack of control resulted in inconsistent connection behaviors, where long-lived sessions might be terminated unexpectedly or held open longer than normal. With this release, the `HTTPKeepAliveTimeout` tuning option has been integrated into the `IngressController` API, providing a formal way for customers to configure and enforce this specific timeout. As a result, cluster administrators now possess the granular control necessary to align connection persistence with specific application needs. (link:https://issues.redhat.com/browse/OCPBUGS-68378[~~OCPBUGS-68378~~])

Show
Before this update, [product-title} 4.16 and later versions failed to respect the timeout `http-keep-alive` setting due to a known upstream HAProxy bug, preventing users from effectively managing connection persistence. This lack of control resulted in inconsistent connection behaviors, where long-lived sessions might be terminated unexpectedly or held open longer than normal. With this release, the `HTTPKeepAliveTimeout` tuning option has been integrated into the `IngressController` API, providing a formal way for customers to configure and enforce this specific timeout. As a result, cluster administrators now possess the granular control necessary to align connection persistence with specific application needs. (link: https://issues.redhat.com/browse/OCPBUGS-68378 [ OCPBUGS-68378 ])

Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:

Description of problem:

After upgrading 4.14 -> 4.16.43 HAProxy pods constantly reconciled in CrashLoop State. The HAProxy socket stats showed that we did hit the 50k Limit over and over and increasing infra nodes.

Customer data indicates that the maxConn Limit is met and exceeded repeatedly - extending the maxconn value significantly did alleviate the pressure condition, and adding additional router pods was necessary to alleviate strain + impact state.

Workaround is in place - reverted the haproxy-router image back to 4.14.48.

We suspect and suggested that idle-close-on-response is our primary driver for this problem and have advised an update to 4.16.44, however there is a good question pending regarding build changes and impact here that we need to clarify:

//QUERY regarding versioning
OpenShift 4.14.48 uses HAProxy 2.6 with "idle-close-on-response" per default in haproxy.config # no connections piling up
OpenShift 4.16.43 uses HAProxy 2.8 with "idle-close-on-response" per default in haproxy.config # connections piling up
OpenShift 4.16.44(+) uses HAProxy 2.8 via "idleConnectionTerminationPolicy: deferred" per default -> "idle-close-on-response" in haproxy.config # should be the same as 4.16.43 (untested).

//Data regarding idle-close-on-response and flag idleConnectionTerminationPolicy from https://redhat-internal.slack.com/archives/CCH60A77E/p1757567064997529

Unconditional idle-close-on-response flag was added in 4.14.23 (4-14 bug)https://issues.redhat.com/browse/OCPBUGS-32437.

Versions before 4.14.0 -> 4.14.22 did not have idle-close-on-response so HAProxies were closing the idle connection immediately.
Starting from 4.14.23 the idle connections are kept open for old HAProxy processes until the last request-response is done.

IdleConnectionTerminationPolicy field was backported to 4.16.44 (4-16 bug) https://issues.redhat.com/browse/OCPBUGS-56424.

So, starting from 4.16.44 it's possible to opt out from idle-close-on-response behavior by setting Immediate as a value. idle-close-on-response option keeps idle connection open for old HAproxy processes until the last request is received and a response for it is sent back. We get a new HAproxy process for each reload which is what we do to update HAproxy configuration (new routes/deleted routes/endpoint updates). This adds up to the total number of connections.

If idle-close-on-response is on old processes did not terminate idle connection until the last request is received or until the idle timeout is expired (~ 5 mins).

//IMPACT: production cluster is currently stable WITH WORKAROUND (rollback version) - cannot stay this way indefinitely in a supported state.

Version-Release number of selected component (if applicable):

    4.16.43

How reproducible:

    Single environment

Steps to Reproduce:

    1. Internal replicator pending
    2.
    3.

Actual results:

Environment is overwhelmed - see graph attachment in first comment attached to issue (internal) - maxconn limit is reached very quickly and connections do not close. Total concurrent connection count goes up 5-10x relative to baseline.

Expected results:

    Cluster platform stability should be maintained - router pods should not hold on to sessions indefinitely and maxConn should not be reached without a corresponding increase in throughput.

Additional info:

    Attachments and data details will be shared in first comment below for analysis + feedback.
Customer platform is still impacted and is stable only via rollback workaround. Data to support and confirm 4.16.44 + with idleConnectionTerminationPolicy enabled is required.

clones

OCPBUGS-66135 [release-4.20] OCP 4.16.43 - HAProxy MaxConn Limit reached/exceeded after upgrade from 4.14 with no change to workload

Closed

depends on

OCPBUGS-66135 [release-4.20] OCP 4.16.43 - HAProxy MaxConn Limit reached/exceeded after upgrade from 4.14 with no change to workload

Closed

is cloned by

OCPBUGS-70315 [release-4.18] OCP 4.16.43 - HAProxy MaxConn Limit reached/exceeded after upgrade from 4.14 with no change to workload

Closed

is depended on by

OCPBUGS-70315 [release-4.18] OCP 4.16.43 - HAProxy MaxConn Limit reached/exceeded after upgrade from 4.14 with no change to workload

Closed

links to

openshift/api#2627: [release-4.19] OCPBUGS-68378: Add HTTPKeepAliveTimeout to IngressController API

openshift/cluster-ingress-operator#1323: [release-4.19] OCPBUGS-68378: Implement HTTPKeepAliveTimeout tuning option

(1 links to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates