Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.19
Component/s: Node / Kubelet
Labels:
- green
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

Likely a longstanding race window, but recent Quay 502s seem to cause image-pull delays that trip some kind of Pod-tracking race, which makes the Job controller mad. For example, in this run, KubeJobFailingSRE was firing because the Job had 1 successful Pod and 1 failed Pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-deployment-validation-operator" and .involvedObject.kind == "Job") | (.firstTimestamp // .eventTime) + " " + (.count | tostring) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[]'
2025-04-27T09:00:40Z 1 Job 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866 SuccessfulCreate: Created pod: 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc
2025-04-27T09:01:06Z 1 Job 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866 SuccessfulCreate: Created pod: 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b9flcl4
2025-04-27T09:01:51Z 1 Job 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866 Completed: Job completed

The successful Pod was still there at gather-time, but the failed job was not:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.namespace == "openshift-deployment-validation-operator" and .metadata.ownerReferences[0].kind == "Job") | .status.phase + " " + .metadata.name'
Succeeded 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b9flcl4

Checking the journal from the Node hosting the failed Pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-deployment-validation-operator" and (tostring | contains("48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc"))) | (.firstTimestamp // .eventTime) + " " + (.count | tostring) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[0]'
2025-04-27T09:00:40.293676Z null Pod 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc Scheduled: Successfully assigned openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc to ip-10-0-14-59.us-west-2.compute.internal
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/nodes/ip-10-0-14-59.us-west-2.compute.internal/journal | zgrep 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc
Apr 27 09:00:40.254884 ip-10-0-14-59 kubenswrapper[2489]: I0427 09:00:40.254852    2489 kubelet.go:2483] "SyncLoop ADD" source="api" pods=["openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc"]
...
Apr 27 09:00:42.993105 ip-10-0-14-59 kubenswrapper[2489]: E0427 09:00:42.993072    2489 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"pull\" with ErrImagePull: \"reading blob sha256:f586d2432afc674cb6429f8c674b6ceaeaa6e57b4dbb0d7e8141a23429d1ccd6: fetching blob: received unexpected HTTP status: 502 Bad Gateway\"" pod="openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc" podUID="ab3e4971-3427-4efd-ba2b-5337d251904c"
...
Apr 27 09:01:05.143057 ip-10-0-14-59 kubenswrapper[2489]: I0427 09:01:05.143023    2489 kubelet.go:2499] "SyncLoop DELETE" source="api" pods=["openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc"]
Apr 27 09:01:12.410898 ip-10-0-14-59 crio[2456]: time="2025-04-27T09:01:12.410818936Z" level=info msg="Creating container: openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc/extract" id=106de6a5-39c9-458e-8190-00f349111731 name=/runtime.v1.RuntimeService/CreateContainer
...

So the quay.io 502 seems to have caused the Kubelet to lose track of the sad Pod, and then the Job complains about the loss (I have release#64441 up hoping to collect the Job JSON), and then the alert notices the Job complaint and fires. Or something like that.

Version-Release number of selected component

Seen in 4.19 ROSA CI.

How reproducible

Seems like a rare race:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=KubeJobFailingSRE' | grep 'failures match'
periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn (all) - 9 runs, 67% failed, 17% of failures match = 11% impact

Steps to Reproduce

1. Run lots of CI.
2. Have stability issues on Quay.
3. Hope you trip into whatever race is going on here.

Actual results

Job failures like:

: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less
Run #0: Failed expand_less	58s
{  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:760]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: 
    promQL query returned unexpected results:
    ALERTS{alertname!~"Watchdog|...",alertstate="firing",severity!="info",namespace!="openshift-e2e-loki"} >= 1
    [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "KubeJobFailingSRE",
          "alertstate": "firing",
          "job_name": "48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866",
          "namespace": "openshift-deployment-validation-operator",
          "owner_name": "48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866",
          "prometheus": "openshift-monitoring/k8s",
          "severity": "warning"
        },
        "value": [
          1745747190.593,
          "1"
        ]
      }
    ]

Expected results

Robust, quiet recovery from short registry hiccups.

is related to

OCPBUGS-55902 Add KubeJobFailingSRE to allowed alerts as a temporary CI exception

MODIFIED

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates