Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Node / Kubelet
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

On nodes where an eviction is taking place, we see from time to time the message ContainerStatusUnknown where we would have expected a status indicating that the pods had overstepped their GPU limit, which is the reason why they were killed.

  - image: quay.globalapp.mindray.com/public/vllm/vllm-openai:v0.11.0_2025-10-29_2
    imageID: ""
    lastState: {}
    name: vllm-qwen-vl-full-syq
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        startedAt: null  


oc get ev 
2h22m       Warning   FailedCreate               replicaset/vllm-qwen3-235b-a22b-fp8-7d89f69555   Error creating: pods "vllm-qwen3-235b-a22b-fp8-7d89f69555-m7jwr" is forbidden: exceeded quota: example, requested: pods=1, used: pods=4, limited: pods=4
2h22m       Warning   FailedCreate               replicaset/vllm-qwen3-235b-a22b-fp8-7d89f69555   Error creating: pods "vllm-qwen3-235b-a22b-fp8-7d89f69555-jghmh" is forbidden: exceeded quota: example, requested: pods=1, used: pods=4, limited: pods=4

$ oc get po vllm-qwen-vl-full-syq-d98d55f9-zl46x -oyaml
    state:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        startedAt: null
  message: 'Pod was rejected: Allocate failed due to requested number of devices unavailable
    for nvidia.com/gpu. Requested: 4, Available: 0, which is unexpected'
  phase: Failed
  qosClass: BestEffort
  reason: UnexpectedAdmissionError
  startTime: "2025-10-30T05:40:04Z"

Additional info:

Project DUmp https://attachments.access.redhat.com/hydra/rest/cases/04294472/attachments/39053386-2e1f-4681-b10a-a1de0d44aeba?usePresignedUrl=true

Assignee:: Kevin Hannon

Reporter:: Qihuan Liu

Need Info From:: None

Contributors:: None

QA Contact:: Mallapadi Niranjan

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/11/05 2:12 AM

Updated:: 2025/11/12 5:08 PM

Resolved:: 2025/11/12 5:08 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates