Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43867

Observing performance degradation for application in Openshift 4.14 while comparing to 4.12

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Major Major
    • None
    • 4.14.z
    • Node / CRI-O
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated

      Description of problem:

      The customer was reporting severe performance degradation for the app in 4.14.11 OCP cluster and which was working fine in 4.12 cluster.

      Observations:

      • The container used to get restarted in 4.12 cluster too but the frequency has been increased around 3 times in 4.14 cluster.
      • The containers are getting restarted due to OOM kill in 4.12 but in 4.14 its was getting restarted due to probe failures.
      • Even after reaching the memory limit the containers are not getting OOM killed in 4.14 instead it is causing probe failures and resulting a container restart. Below are the statistics collected from customer cluster:

       

       containerStatuses:
        - containerID: cri-o://b0e75ac47849a3fbf0cecaa5ab5571770b6cea822e9d986db9df44ad2e30903d
          image: image-registry.openshift-image-registry.svc:5000/finaclecorein/dbs-fincore_mod_ac_finlistval:7663878c92bb782b72792605f02016d26de689fa_sp29_31072023_D-FINACLAD-521
          imageID: image-registry.openshift-image-registry.svc:5000/finaclecorein/dbs-fincore_mod_ac_finlistval@sha256:3cf3a240be4ea4337bdffc0a7b66f92d1fbd1d2d83e20d2847dd6f162a61b788
          lastState:
            terminated:
              containerID: cri-o://17b89e0ecbe593330170ea8fed19abdcc2256ecffe1c26977bd3254f73451ff4
              exitCode: 137
              finishedAt: "2024-10-23T11:15:06Z"
              reason: Error
              startedAt: "2024-10-23T10:53:04Z"
      • Exit code 137 can be also result from a failed health check as the reason is showing as error. but in 4.12 if the container is restarted then it used to show as OOMkilled in the reason.
      • There was no change application side and only difference is OCP version

       

      • By looking at the utilization of the container during the issue its evident that those are required more resources but customer want to know why the containers are required more resources that 4.12 cluster while its in 4.14
      • So looking for the clarity on below points:
      • Why the containers are not getting OOM killed even after reaching the memory limit, instead we are seeing probe failures and it resulting a restart.
      • OCP 4.12.9 cluster is based on RHEL 8.x OS, whereas OCP 4.14.11 cluster is based on RHEL 9.x OS. This is one of the huge differences between both the clusters. Considering this aspect, do this have any impact on their application performance - especially w.r.t CPU & Memory utilization of the pods?

      Version-Release number of selected component (if applicable):

          4.14.11

      How reproducible:

          customer has been reproduced in multiple cluster

      Actual results:

          The container is not getting OOM killed once reach the memory limit

      Expected results:

          The container should get OOM killed if it reach the memory limit

      Additional info:

          Cgroup is "v1"

              sgrunert@redhat.com Sascha Grunert
              rhn-support-amuhamme MUHAMMED ASLAM V K
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: