Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: kube-controller-manager
Labels:
- triaged

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:


Deployment with OOMKilled Pod results in retriable failure loop creating thousands of ContainerStatusUnknown pods

```
      state:
        terminated:
          exitCode: 137
          finishedAt: null
          message: The container could not be located when the pod was terminated
          reason: ContainerStatusUnknown
          startedAt: null
```

> The Kubernetes OOMKilled (Exit Code 137) is a signal sent by the Linux Kernel to terminate a process due to an Out Of Memory (OOM) condition. This event is usually an indication that a container in a pod has exceeded its memory limit and the system cannot allocate additional memory.

It's considered a retryable error [3329-retriable-and-non-retriable-failures/README.md?plain=1#L600](https://github.com/kubernetes/enhancements/blob/62039f1b315a370210af4c8a19618855af9d70ae/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md?plain=1#L600)

It continues to retry and created 12000 Pods

Note: 
_System memory usage of 1.698G on blue3-w4.blue3.toropsp.com exceeds 95% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The default reservation is expected to be sufficient for most configurations and should be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods (either due to rate of change or at steady state)._

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Once

Steps to Reproduce:

1. Setup a cluster with Topology Manager single-numa-node policy
2. Create a Pod with one 1cpu 
3. Create a Deployment requesting lots of cpu/memory that could be allocated, however would trigger OOMKiller (e.g., able to schedule, then hits OOMKill)
4. Watch the Pods generate every few seconds/minutes

You may want to allocate 1.6G artificially on the Host to demonstrate the error.

Actual results:

Lots of Pods created (unexpected)

Expected results:

One failure pod

Additional info:

I'll provide access to the logs on a per-person basis as it contains extra details.

is related to

OCPBUGS-44737 Loop while creating pods with deployment using nodename for scheduling on particular node that has Noexcute Taint

relates to

OCPBUGS-5807 ReplicaSet controller continuously creating pods failing due to SysctlForbidden

OCPBUGS-42257 DaemonSet is reporting incorrect number of ready pods, causing pod flooding on specific OpenShift Container Platform 4 - Node

OCPBUGS-16379 When a pod template in a deployment is specified with a matching `nodename` and a never-matching `nodeSelector` an unlimited number of pods are created

ASSIGNED

Assignee:: Filip Krepinsky

Reporter:: Paul Bastide

QA Contact:: Doug Slavens

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/01/07 9:28 PM

Updated:: 2025/01/24 8:05 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates