Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Node / CRI-O
Labels:
None

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

The customer was reporting severe performance degradation for the app in 4.14.11 OCP cluster and which was working fine in 4.12 cluster.

Observations:

The container used to get restarted in 4.12 cluster too but the frequency has been increased around 3 times in 4.14 cluster.
The containers are getting restarted due to OOM kill in 4.12 but in 4.14 its was getting restarted due to probe failures.
Even after reaching the memory limit the containers are not getting OOM killed in 4.14 instead it is causing probe failures and resulting a container restart. Below are the statistics collected from customer cluster:

 containerStatuses:
  - containerID: cri-o://b0e75ac47849a3fbf0cecaa5ab5571770b6cea822e9d986db9df44ad2e30903d
    image: image-registry.openshift-image-registry.svc:5000/finaclecorein/dbs-fincore_mod_ac_finlistval:7663878c92bb782b72792605f02016d26de689fa_sp29_31072023_D-FINACLAD-521
    imageID: image-registry.openshift-image-registry.svc:5000/finaclecorein/dbs-fincore_mod_ac_finlistval@sha256:3cf3a240be4ea4337bdffc0a7b66f92d1fbd1d2d83e20d2847dd6f162a61b788
    lastState:
      terminated:
        containerID: cri-o://17b89e0ecbe593330170ea8fed19abdcc2256ecffe1c26977bd3254f73451ff4
        exitCode: 137
        finishedAt: "2024-10-23T11:15:06Z"
        reason: Error
        startedAt: "2024-10-23T10:53:04Z"

Exit code 137 can be also result from a failed health check as the reason is showing as error. but in 4.12 if the container is restarted then it used to show as OOMkilled in the reason.
There was no change application side and only difference is OCP version

By looking at the utilization of the container during the issue its evident that those are required more resources but customer want to know why the containers are required more resources that 4.12 cluster while its in 4.14

So looking for the clarity on below points:

Why the containers are not getting OOM killed even after reaching the memory limit, instead we are seeing probe failures and it resulting a restart.
OCP 4.12.9 cluster is based on RHEL 8.x OS, whereas OCP 4.14.11 cluster is based on RHEL 9.x OS. This is one of the huge differences between both the clusters. Considering this aspect, do this have any impact on their application performance - especially w.r.t CPU & Memory utilization of the pods?

Version-Release number of selected component (if applicable):

    4.14.11

How reproducible:

    customer has been reproduced in multiple cluster

Actual results:

    The container is not getting OOM killed once reach the memory limit

Expected results:

    The container should get OOM killed if it reach the memory limit

Additional info:

    Cgroup is "v1"

Assignee:: Sascha Grunert

Reporter:: MUHAMMED ASLAM V K

QA Contact:: Sunil Choudhary

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2024/10/28 6:22 AM

Updated:: 2024/11/12 3:58 PM

Resolved:: 2024/11/12 3:58 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates