Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.12.z
Component/s: Networking / router
Labels:
- backport-requested
- ne-triaged

Test Coverage:

+
Severity:
Critical
Regression:
Yes
Story Points:
5
Sprint:
Sprint 250, Sprint 251, Sprint 252, Sprint 253
sprint_count:
4
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the `random` algorithm excessively in environments with many inactive services or environments routing backends with weight 0. This led to increased memory usage and a higher risk of excessive memory consumption. With this release, changes are made to optimize traffic direction towards active services only and prevent unnecessary use of the `random` algorithm with higher weights, reducing the potential for excessive memory consumption. (link:https://issues.redhat.com/browse/OCPBUGS-29690[*~~OCPBUGS-29690~~*])

Show
* Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the `random` algorithm excessively in environments with many inactive services or environments routing backends with weight 0. This led to increased memory usage and a higher risk of excessive memory consumption. With this release, changes are made to optimize traffic direction towards active services only and prevent unnecessary use of the `random` algorithm with higher weights, reducing the potential for excessive memory consumption. (link: https://issues.redhat.com/browse/OCPBUGS-29690 [* OCPBUGS-29690 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0
Target Backport Versions:

4.13.z, 4.12.z, 4.14.z, 4.15.z
Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

    Router are restarting due to memory issues

Version-Release number of selected component (if applicable):

    OCP 4.12.45

How reproducible:

    not easy

Router restart due to memory issues:
~~~
3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Liveness probe error: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Liveness probe failed: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
3h40m       Normal    Killing      pod/router-default-56c9f67f66-j8xwn                        Container router failed liveness probe, will be restarted
3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: HTTP probe failed with statuscode: 500...
3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: HTTP probe failed with statuscode: 500
~~~

The node only host the router replica, and from prometheus it can be verified that routers are consumming all the memory in a short period of time ~20G with an hour.

At some point, the number of haproxy are increasing and ending consuming all memory resources leading in a service disruption in a productive environment.

As console is one of the service with highest activity as per router stats, so far customer is deleting the console pod and process decreasing from 45 to 12. 

Customer is willing to have a guidance about how to identify the process that is consuming the memory, haproxy monitoring is enabled but no dashboard available. 

Router stats from when the router has 8g-6g-3g of memory available has been requested.

Additional info:

 Customer is claiming that this is a happening only in OCP 4.12.45, as other active cluster is still in version 4.10.39 and this is not happening. Upgrade is blocked because of this .

Requested action:
* hard-stop-after might be an option but customer expect information about side effects of this configuration.
* How to reset console connection from haproxy?
* Is there any documentation about haproxy prometheus queries?

blocks

OCPBUGS-32977 haproxy oom - troubleshoot process

Closed

is cloned by

OCPBUGS-32977 haproxy oom - troubleshoot process

Closed

is related to

OCPBUGS-33533 Incorrect Load Balancing Algorithm Applied Due to Mismatched Ports in spec.port.to and Alternate Backend

Closed

links to

openshift/router#576: OCPBUGS-29690: Count active services before setting weight to 1

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Andrew McDermott

Reporter:: Pamela Lizeth Escorza Gil

QA Contact:: Shudi Li

Contributors:: Melvin Joseph

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2024/02/20 7:15 AM

Updated:: 2024/09/09 6:57 AM

Resolved:: 2024/06/27 11:46 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates