Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.z
Affects Version/s: 4.12.z
Component/s: Networking / router
Labels:
- ne-triaged

Severity:
Critical
Regression:
No
Story Points:
1
Sprint:
Sprint 252, Sprint 253
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the `random` algorithm excessively in environments with many inactive services or environments routing backends with weight 0. This led to increased memory usage and a higher risk of excessive memory consumption. With this release, changes are made to optimize traffic direction towards active services only and prevent unnecessary use of the `random` algorithm with higher weights, reducing the potential for excessive memory consumption. (link:https://issues.redhat.com/browse/OCPBUGS-32977[*~~OCPBUGS-32977~~*])
________________________
Cause:
Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the "random" algorithm excessively in environments with many inactive services or those routing backends with weight 0.

Consequence:
This led to increased memory usage and a higher risk of excessive memory consumption.

Fix:
Enhanced the service filtering logic in load balancing: Inactive services are now excluded when calculating weights. When there is only one active service, its weight is set to 1 to direct traffic exclusively to it. Improved template logic to more accurately handle service activity for algorithm settings.

Results:
These changes optimise traffic direction towards active services only and prevent unnecessary use of the "random" algorithm with higher weights, thereby reducing the potential for excessive memory consumption.

Show
* Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the `random` algorithm excessively in environments with many inactive services or environments routing backends with weight 0. This led to increased memory usage and a higher risk of excessive memory consumption. With this release, changes are made to optimize traffic direction towards active services only and prevent unnecessary use of the `random` algorithm with higher weights, reducing the potential for excessive memory consumption. (link: https://issues.redhat.com/browse/OCPBUGS-32977 [* OCPBUGS-32977 *]) ________________________ Cause: Previously, the load balancing algorithm did not differentiate between active and inactive services when determining weights, and it employed the "random" algorithm excessively in environments with many inactive services or those routing backends with weight 0. Consequence: This led to increased memory usage and a higher risk of excessive memory consumption. Fix: Enhanced the service filtering logic in load balancing: Inactive services are now excluded when calculating weights. When there is only one active service, its weight is set to 1 to direct traffic exclusively to it. Improved template logic to more accurately handle service activity for algorithm settings. Results: These changes optimise traffic direction towards active services only and prevent unnecessary use of the "random" algorithm with higher weights, thereby reducing the potential for excessive memory consumption.
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.15.z
Target Backport Versions:

4.13.z, 4.12.z, 4.14.z, 4.15.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-29690~~. The following is the description of the original issue:
—
Description of problem:

    Router are restarting due to memory issues

Version-Release number of selected component (if applicable):

    OCP 4.12.45

How reproducible:

    not easy

Router restart due to memory issues:
~~~
3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Liveness probe error: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Liveness probe failed: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
3h40m       Normal    Killing      pod/router-default-56c9f67f66-j8xwn                        Container router failed liveness probe, will be restarted
3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: HTTP probe failed with statuscode: 500...
3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: HTTP probe failed with statuscode: 500
~~~

The node only host the router replica, and from prometheus it can be verified that routers are consumming all the memory in a short period of time ~20G with an hour.

At some point, the number of haproxy are increasing and ending consuming all memory resources leading in a service disruption in a productive environment.

As console is one of the service with highest activity as per router stats, so far customer is deleting the console pod and process decreasing from 45 to 12. 

Customer is willing to have a guidance about how to identify the process that is consuming the memory, haproxy monitoring is enabled but no dashboard available. 

Router stats from when the router has 8g-6g-3g of memory available has been requested.

Additional info:

 Customer is claiming that this is a happening only in OCP 4.12.45, as other active cluster is still in version 4.10.39 and this is not happening. Upgrade is blocked because of this .

Requested action:
* hard-stop-after might be an option but customer expect information about side effects of this configuration.
* How to reset console connection from haproxy?
* Is there any documentation about haproxy prometheus queries?

blocks

OCPBUGS-33389 [4.14 backport] haproxy oom - troubleshoot process

Closed

clones

OCPBUGS-29690 haproxy oom - troubleshoot process

Closed

is blocked by

OCPBUGS-29690 haproxy oom - troubleshoot process

Closed

is cloned by

OCPBUGS-33389 [4.14 backport] haproxy oom - troubleshoot process

Closed

links to

openshift/router#586: [release-4.15] OCPBUGS-32977: Count active services before setting weight to 1

RHBA-2024:2773 OpenShift Container Platform 4.15.z bug fix update

(1 links to)

Assignee:: Andrew McDermott

Reporter:: OpenShift Prow Bot

QA Contact:: Shudi Li

Contributors:: Melvin Joseph

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/04/25 1:49 PM

Updated:: 2024/05/21 5:25 PM

Resolved:: 2024/05/15 6:44 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates