-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
4.14.z
-
None
-
None
-
False
-
-
Customer Escalated
Description of problem:
The customer was reporting severe performance degradation for the app in 4.14.11 OCP cluster and which was working fine in 4.12 cluster.
Observations:
- The container used to get restarted in 4.12 cluster too but the frequency has been increased around 3 times in 4.14 cluster.
- The containers are getting restarted due to OOM kill in 4.12 but in 4.14 its was getting restarted due to probe failures.
- Even after reaching the memory limit the containers are not getting OOM killed in 4.14 instead it is causing probe failures and resulting a container restart. Below are the statistics collected from customer cluster:
containerStatuses: - containerID: cri-o://b0e75ac47849a3fbf0cecaa5ab5571770b6cea822e9d986db9df44ad2e30903d image: image-registry.openshift-image-registry.svc:5000/finaclecorein/dbs-fincore_mod_ac_finlistval:7663878c92bb782b72792605f02016d26de689fa_sp29_31072023_D-FINACLAD-521 imageID: image-registry.openshift-image-registry.svc:5000/finaclecorein/dbs-fincore_mod_ac_finlistval@sha256:3cf3a240be4ea4337bdffc0a7b66f92d1fbd1d2d83e20d2847dd6f162a61b788 lastState: terminated: containerID: cri-o://17b89e0ecbe593330170ea8fed19abdcc2256ecffe1c26977bd3254f73451ff4 exitCode: 137 finishedAt: "2024-10-23T11:15:06Z" reason: Error startedAt: "2024-10-23T10:53:04Z"
- Exit code 137 can be also result from a failed health check as the reason is showing as error. but in 4.12 if the container is restarted then it used to show as OOMkilled in the reason.
- There was no change application side and only difference is OCP version
- By looking at the utilization of the container during the issue its evident that those are required more resources but customer want to know why the containers are required more resources that 4.12 cluster while its in 4.14
- So looking for the clarity on below points:
- Why the containers are not getting OOM killed even after reaching the memory limit, instead we are seeing probe failures and it resulting a restart.
- OCP 4.12.9 cluster is based on RHEL 8.x OS, whereas OCP 4.14.11 cluster is based on RHEL 9.x OS. This is one of the huge differences between both the clusters. Considering this aspect, do this have any impact on their application performance - especially w.r.t CPU & Memory utilization of the pods?
Version-Release number of selected component (if applicable):
4.14.11
How reproducible:
customer has been reproduced in multiple cluster
Actual results:
The container is not getting OOM killed once reach the memory limit
Expected results:
The container should get OOM killed if it reach the memory limit
Additional info:
Cgroup is "v1"