-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.16.z
-
None
-
False
-
Description of problem:
Deployment with OOMKilled Pod results in retriable failure loop creating thousands of ContainerStatusUnknown pods ``` state: terminated: exitCode: 137 finishedAt: null message: The container could not be located when the pod was terminated reason: ContainerStatusUnknown startedAt: null ``` > The Kubernetes OOMKilled (Exit Code 137) is a signal sent by the Linux Kernel to terminate a process due to an Out Of Memory (OOM) condition. This event is usually an indication that a container in a pod has exceeded its memory limit and the system cannot allocate additional memory. It's considered a retryable error [3329-retriable-and-non-retriable-failures/README.md?plain=1#L600](https://github.com/kubernetes/enhancements/blob/62039f1b315a370210af4c8a19618855af9d70ae/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md?plain=1#L600) It continues to retry and created 12000 Pods Note: _System memory usage of 1.698G on blue3-w4.blue3.toropsp.com exceeds 95% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The default reservation is expected to be sufficient for most configurations and should be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods (either due to rate of change or at steady state)._
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Once
Steps to Reproduce:
1. Setup a cluster with Topology Manager single-numa-node policy 2. Create a Pod with one 1cpu 3. Create a Deployment requesting lots of cpu/memory that could be allocated, however would trigger OOMKiller (e.g., able to schedule, then hits OOMKill) 4. Watch the Pods generate every few seconds/minutes You may want to allocate 1.6G artificially on the Host to demonstrate the error.
Actual results:
Lots of Pods created (unexpected)
Expected results:
One failure pod
Additional info:
I'll provide access to the logs on a per-person basis as it contains extra details.
- is related to
-
OCPBUGS-44737 Loop while creating pods with deployment using nodename for scheduling on particular node that has Noexcute Taint
- New
- relates to
-
OCPBUGS-5807 ReplicaSet controller continuously creating pods failing due to SysctlForbidden
- New
-
OCPBUGS-42257 DaemonSet is reporting incorrect number of ready pods, causing pod flooding on specific OpenShift Container Platform 4 - Node
- New
-
OCPBUGS-16379 When a pod template in a deployment is specified with a matching `nodename` and a never-matching `nodeSelector` an unlimited number of pods are created
- ASSIGNED