-
Bug
-
Resolution: Done-Errata
-
Major
-
4.12.z
Description of problem:
Router are restarting due to memory issues
Version-Release number of selected component (if applicable):
OCP 4.12.45
How reproducible:
not easy
Router restart due to memory issues: ~~~ 3h40m Warning ProbeError pod/router-default-56c9f67f66-j8xwn Readiness probe error: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)... 3h40m Warning Unhealthy pod/router-default-56c9f67f66-j8xwn Readiness probe failed: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 3h40m Warning ProbeError pod/router-default-56c9f67f66-j8xwn Liveness probe error: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)... 3h40m Warning Unhealthy pod/router-default-56c9f67f66-j8xwn Liveness probe failed: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 3h40m Normal Killing pod/router-default-56c9f67f66-j8xwn Container router failed liveness probe, will be restarted 3h40m Warning ProbeError pod/router-default-56c9f67f66-j8xwn Readiness probe error: HTTP probe failed with statuscode: 500... 3h40m Warning Unhealthy pod/router-default-56c9f67f66-j8xwn Readiness probe failed: HTTP probe failed with statuscode: 500 ~~~ The node only host the router replica, and from prometheus it can be verified that routers are consumming all the memory in a short period of time ~20G with an hour. At some point, the number of haproxy are increasing and ending consuming all memory resources leading in a service disruption in a productive environment. As console is one of the service with highest activity as per router stats, so far customer is deleting the console pod and process decreasing from 45 to 12. Customer is willing to have a guidance about how to identify the process that is consuming the memory, haproxy monitoring is enabled but no dashboard available. Router stats from when the router has 8g-6g-3g of memory available has been requested.
Additional info:
Customer is claiming that this is a happening only in OCP 4.12.45, as other active cluster is still in version 4.10.39 and this is not happening. Upgrade is blocked because of this . Requested action: * hard-stop-after might be an option but customer expect information about side effects of this configuration. * How to reset console connection from haproxy? * Is there any documentation about haproxy prometheus queries?
- blocks
-
OCPBUGS-32977 haproxy oom - troubleshoot process
- Closed
- is cloned by
-
OCPBUGS-32977 haproxy oom - troubleshoot process
- Closed
- is related to
-
OCPBUGS-33533 Incorrect Load Balancing Algorithm Applied Due to Mismatched Ports in spec.port.to and Alternate Backend
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update