Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: 4.21
Component/s: Node / Kubelet
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Low
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

Seen in 4.21 CI:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/pods.json | jq '.items[] | select(.metadata.name == "pod-network-to-pod-network-disruption-poller-564f6885f5-q9qjp") | {metadata: (.metadata | {creationTimestamp, deletionTimestamp}), status}'
{
  "metadata": {
    "creationTimestamp": "2025-10-29T07:18:28Z",
    "deletionTimestamp": "2025-10-29T08:19:53Z"
  },
  "status": {
    "conditions": [
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2025-10-29T08:18:42Z",
        "message": "Eviction API: evicting",
        "reason": "EvictionByEvictionAPI",
        "status": "True",
        "type": "DisruptionTarget"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2025-10-29T07:18:28Z",
        "observedGeneration": 1,
        "status": "False",
        "type": "PodReadyToStartContainers"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2025-10-29T07:18:28Z",
        "observedGeneration": 1,
        "status": "True",
        "type": "Initialized"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2025-10-29T07:18:28Z",
        "message": "containers with unready status: [disruption-poller]",
        "observedGeneration": 1,
        "reason": "ContainersNotReady",
        "status": "False",
        "type": "Ready"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2025-10-29T07:18:28Z",
        "message": "containers with unready status: [disruption-poller]",
        "observedGeneration": 1,
        "reason": "ContainersNotReady",
        "status": "False",
        "type": "ContainersReady"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2025-10-29T07:18:28Z",
        "observedGeneration": 1,
        "status": "True",
        "type": "PodScheduled"
      }
    ],
    "containerStatuses": [
      {
        "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2e65bfb548f2a436a57b61ab5485ce145e88ee7c97c0834b48b4733843295fca",
        "imageID": "",
        "lastState": {},
        "name": "disruption-poller",
        "ready": false,
        "restartCount": 0,
        "started": false,
        "state": {
          "waiting": {
            "reason": "ContainerCreating"
          }
        },
        "volumeMounts": [
          {
            "mountPath": "/var/log/persistent-logs",
            "name": "persistent-log-dir"
          },
          {
            "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
            "name": "kube-api-access-x7tqq",
            "readOnly": true,
            "recursiveReadOnly": "Disabled"
          }
        ]
      }
    ],
    "hostIP": "10.0.73.59",
    "hostIPs": [
      {
        "ip": "10.0.73.59"
      }
    ],
    "observedGeneration": 1,
    "phase": "Pending",
    "qosClass": "BestEffort",
    "startTime": "2025-10-29T07:18:28Z"
  }
}

Version-Release number of selected component

Seen in 4.21. Unclear how this issue presents in 4.20 and earlier.

How reproducible

Unclear.

Steps to Reproduce

1. Run lots of CI.
2. Have some Pods gets stuck in ContainerCreating.
3. Have non-experts like me try to understand where in the process they got stuck using Pod status.

Actual results

None of the conditions seems to talk clearly about what the next step is on the way to Ready=True.

Expected results

Clear messaging about what we're waiting for, and what we're seeing instead. Having a message on PodReadyToStartContainers might be a good next step.

Additional info

KubeContainerWaiting is limited to OCP-core namespaces, so it doesn't cover e2e namespaces like e2e-pod-network-disruption-test-sc25z. But riffing on that metric in PromeCIeus:

max by (namespace, pod, container, reason) (
  kube_pod_container_status_waiting_reason{reason!="CrashLoopBackOff", job="kube-state-metrics"} > 0
  *
  (kube_pod_container_status_waiting_reason{reason!="CrashLoopBackOff", job="kube-state-metrics"} offset 10m > 0)
)

turns up two Pods that stuck this way for at least 15m (one eventually recovered, or was successfully deleted):

{container="disruption-poller", namespace="e2e-pod-network-disruption-test-sc25z", pod="pod-network-to-host-network-disruption-poller-565bbc6fc-w8swj", reason="ContainerCreating"}
{container="disruption-poller", namespace="e2e-pod-network-disruption-test-sc25z", pod="pod-network-to-pod-network-disruption-poller-564f6885f5-q9qjp", reason="ContainerCreating"}

Shipping a KubeContainerWaiting runbook might be another way to make this kind of issue more debuggable.

Assignee:: Harshal Patil

Reporter:: W. Trevor King

Need Info From:: None

Contributors:: None

QA Contact:: Min Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/11/04 5:46 AM

Updated:: 2025/11/14 2:54 AM

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional info

Attachments

Easy Agile Planning Poker

Activity

People

Dates