-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.18.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
On nodes where an eviction is taking place, we see from time to time the message ContainerStatusUnknown where we would have expected a status indicating that the pods had overstepped their GPU limit, which is the reason why they were killed.
- image: quay.globalapp.mindray.com/public/vllm/vllm-openai:v0.11.0_2025-10-29_2
imageID: ""
lastState: {}
name: vllm-qwen-vl-full-syq
ready: false
restartCount: 0
started: false
state:
terminated:
exitCode: 137
finishedAt: null
message: The container could not be located when the pod was terminated
reason: ContainerStatusUnknown
startedAt: null
oc get ev
2h22m Warning FailedCreate replicaset/vllm-qwen3-235b-a22b-fp8-7d89f69555 Error creating: pods "vllm-qwen3-235b-a22b-fp8-7d89f69555-m7jwr" is forbidden: exceeded quota: example, requested: pods=1, used: pods=4, limited: pods=4
2h22m Warning FailedCreate replicaset/vllm-qwen3-235b-a22b-fp8-7d89f69555 Error creating: pods "vllm-qwen3-235b-a22b-fp8-7d89f69555-jghmh" is forbidden: exceeded quota: example, requested: pods=1, used: pods=4, limited: pods=4
$ oc get po vllm-qwen-vl-full-syq-d98d55f9-zl46x -oyaml
state:
terminated:
exitCode: 137
finishedAt: null
message: The container could not be located when the pod was terminated
reason: ContainerStatusUnknown
startedAt: null
message: 'Pod was rejected: Allocate failed due to requested number of devices unavailable
for nvidia.com/gpu. Requested: 4, Available: 0, which is unexpected'
phase: Failed
qosClass: BestEffort
reason: UnexpectedAdmissionError
startTime: "2025-10-30T05:40:04Z"
Additional info:
Project DUmp https://attachments.access.redhat.com/hydra/rest/cases/04294472/attachments/39053386-2e1f-4681-b10a-a1de0d44aeba?usePresignedUrl=true