Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-64662

Pods evicted with ContainerStatusUnknown status when the GPU resources limit is exceeded

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Normal Normal
    • None
    • 4.18.z
    • Node / Kubelet
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      On nodes where an eviction is taking place, we see from time to time the message ContainerStatusUnknown where we would have expected a status indicating that the pods had overstepped their GPU limit, which is the reason why they were killed.

        - image: quay.globalapp.mindray.com/public/vllm/vllm-openai:v0.11.0_2025-10-29_2
          imageID: ""
          lastState: {}
          name: vllm-qwen-vl-full-syq
          ready: false
          restartCount: 0
          started: false
          state:
            terminated:
              exitCode: 137
              finishedAt: null
              message: The container could not be located when the pod was terminated
              reason: ContainerStatusUnknown
              startedAt: null  
      
      
      oc get ev 
      2h22m       Warning   FailedCreate               replicaset/vllm-qwen3-235b-a22b-fp8-7d89f69555   Error creating: pods "vllm-qwen3-235b-a22b-fp8-7d89f69555-m7jwr" is forbidden: exceeded quota: example, requested: pods=1, used: pods=4, limited: pods=4
      2h22m       Warning   FailedCreate               replicaset/vllm-qwen3-235b-a22b-fp8-7d89f69555   Error creating: pods "vllm-qwen3-235b-a22b-fp8-7d89f69555-jghmh" is forbidden: exceeded quota: example, requested: pods=1, used: pods=4, limited: pods=4
      
      $ oc get po vllm-qwen-vl-full-syq-d98d55f9-zl46x -oyaml
          state:
            terminated:
              exitCode: 137
              finishedAt: null
              message: The container could not be located when the pod was terminated
              reason: ContainerStatusUnknown
              startedAt: null
        message: 'Pod was rejected: Allocate failed due to requested number of devices unavailable
          for nvidia.com/gpu. Requested: 4, Available: 0, which is unexpected'
        phase: Failed
        qosClass: BestEffort
        reason: UnexpectedAdmissionError
        startTime: "2025-10-30T05:40:04Z"

      Additional info:

      Project DUmp https://attachments.access.redhat.com/hydra/rest/cases/04294472/attachments/39053386-2e1f-4681-b10a-a1de0d44aeba?usePresignedUrl=true

              rh-ee-kehannon Kevin Hannon
              rhn-support-danliu Qihuan Liu
              None
              None
              Mallapadi Niranjan Mallapadi Niranjan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: