Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48107

Deployment with OOMKilled Pod results in retriable failure loop creating thousands of ContainerStatusUnknown pods

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.16.z
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      
      Deployment with OOMKilled Pod results in retriable failure loop creating thousands of ContainerStatusUnknown pods
      
      ```
            state:
              terminated:
                exitCode: 137
                finishedAt: null
                message: The container could not be located when the pod was terminated
                reason: ContainerStatusUnknown
                startedAt: null
      ```
      
      > The Kubernetes OOMKilled (Exit Code 137) is a signal sent by the Linux Kernel to terminate a process due to an Out Of Memory (OOM) condition. This event is usually an indication that a container in a pod has exceeded its memory limit and the system cannot allocate additional memory.
      
      It's considered a retryable error [3329-retriable-and-non-retriable-failures/README.md?plain=1#L600](https://github.com/kubernetes/enhancements/blob/62039f1b315a370210af4c8a19618855af9d70ae/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md?plain=1#L600)
      
      It continues to retry and created 12000 Pods
      
      Note: 
      _System memory usage of 1.698G on blue3-w4.blue3.toropsp.com exceeds 95% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The default reservation is expected to be sufficient for most configurations and should be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods (either due to rate of change or at steady state)._
      
          

      Version-Release number of selected component (if applicable):

      4.16
          

      How reproducible:

      Once
      
          

      Steps to Reproduce:

      1. Setup a cluster with Topology Manager single-numa-node policy
      2. Create a Pod with one 1cpu 
      3. Create a Deployment requesting lots of cpu/memory that could be allocated, however would trigger OOMKiller (e.g., able to schedule, then hits OOMKill)
      4. Watch the Pods generate every few seconds/minutes
      
      You may want to allocate 1.6G artificially on the Host to demonstrate the error.
      
          

      Actual results:

      Lots of Pods created (unexpected)
      
          

      Expected results:

      One failure pod
      
          

      Additional info:

      I'll provide access to the logs on a per-person basis as it contains extra details.
          

              fkrepins@redhat.com Filip Krepinsky
              pbastide_rh Paul Bastide
              Doug Slavens Doug Slavens
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: