Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-64625

Pod stuck in ContainerCreating should be debuggable

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • 4.21
    • Node / Kubelet
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem

      Seen in 4.21 CI:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/pods.json | jq '.items[] | select(.metadata.name == "pod-network-to-pod-network-disruption-poller-564f6885f5-q9qjp") | {metadata: (.metadata | {creationTimestamp, deletionTimestamp}), status}'
      {
        "metadata": {
          "creationTimestamp": "2025-10-29T07:18:28Z",
          "deletionTimestamp": "2025-10-29T08:19:53Z"
        },
        "status": {
          "conditions": [
            {
              "lastProbeTime": null,
              "lastTransitionTime": "2025-10-29T08:18:42Z",
              "message": "Eviction API: evicting",
              "reason": "EvictionByEvictionAPI",
              "status": "True",
              "type": "DisruptionTarget"
            },
            {
              "lastProbeTime": null,
              "lastTransitionTime": "2025-10-29T07:18:28Z",
              "observedGeneration": 1,
              "status": "False",
              "type": "PodReadyToStartContainers"
            },
            {
              "lastProbeTime": null,
              "lastTransitionTime": "2025-10-29T07:18:28Z",
              "observedGeneration": 1,
              "status": "True",
              "type": "Initialized"
            },
            {
              "lastProbeTime": null,
              "lastTransitionTime": "2025-10-29T07:18:28Z",
              "message": "containers with unready status: [disruption-poller]",
              "observedGeneration": 1,
              "reason": "ContainersNotReady",
              "status": "False",
              "type": "Ready"
            },
            {
              "lastProbeTime": null,
              "lastTransitionTime": "2025-10-29T07:18:28Z",
              "message": "containers with unready status: [disruption-poller]",
              "observedGeneration": 1,
              "reason": "ContainersNotReady",
              "status": "False",
              "type": "ContainersReady"
            },
            {
              "lastProbeTime": null,
              "lastTransitionTime": "2025-10-29T07:18:28Z",
              "observedGeneration": 1,
              "status": "True",
              "type": "PodScheduled"
            }
          ],
          "containerStatuses": [
            {
              "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2e65bfb548f2a436a57b61ab5485ce145e88ee7c97c0834b48b4733843295fca",
              "imageID": "",
              "lastState": {},
              "name": "disruption-poller",
              "ready": false,
              "restartCount": 0,
              "started": false,
              "state": {
                "waiting": {
                  "reason": "ContainerCreating"
                }
              },
              "volumeMounts": [
                {
                  "mountPath": "/var/log/persistent-logs",
                  "name": "persistent-log-dir"
                },
                {
                  "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
                  "name": "kube-api-access-x7tqq",
                  "readOnly": true,
                  "recursiveReadOnly": "Disabled"
                }
              ]
            }
          ],
          "hostIP": "10.0.73.59",
          "hostIPs": [
            {
              "ip": "10.0.73.59"
            }
          ],
          "observedGeneration": 1,
          "phase": "Pending",
          "qosClass": "BestEffort",
          "startTime": "2025-10-29T07:18:28Z"
        }
      }
      

      Version-Release number of selected component

      Seen in 4.21. Unclear how this issue presents in 4.20 and earlier.

      How reproducible

      Unclear.

      Steps to Reproduce

      1. Run lots of CI.
      2. Have some Pods gets stuck in ContainerCreating.
      3. Have non-experts like me try to understand where in the process they got stuck using Pod status.

      Actual results

      None of the conditions seems to talk clearly about what the next step is on the way to Ready=True.

      Expected results

      Clear messaging about what we're waiting for, and what we're seeing instead. Having a message on PodReadyToStartContainers might be a good next step.

      Additional info

      KubeContainerWaiting is limited to OCP-core namespaces, so it doesn't cover e2e namespaces like e2e-pod-network-disruption-test-sc25z.  But riffing on that metric in PromeCIeus:

      max by (namespace, pod, container, reason) (
        kube_pod_container_status_waiting_reason{reason!="CrashLoopBackOff", job="kube-state-metrics"} > 0
        *
        (kube_pod_container_status_waiting_reason{reason!="CrashLoopBackOff", job="kube-state-metrics"} offset 10m > 0)
      )
      

      turns up two Pods that stuck this way for at least 15m (one eventually recovered, or was successfully deleted):

      {container="disruption-poller", namespace="e2e-pod-network-disruption-test-sc25z", pod="pod-network-to-host-network-disruption-poller-565bbc6fc-w8swj", reason="ContainerCreating"}
      {container="disruption-poller", namespace="e2e-pod-network-disruption-test-sc25z", pod="pod-network-to-pod-network-disruption-poller-564f6885f5-q9qjp", reason="ContainerCreating"}
      

      Shipping a KubeContainerWaiting runbook might be another way to make this kind of issue more debuggable.

              harpatil@redhat.com Harshal Patil
              trking W. Trevor King
              None
              None
              Min Li Min Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: