-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.19
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem
Likely a longstanding race window, but recent Quay 502s seem to cause image-pull delays that trip some kind of Pod-tracking race, which makes the Job controller mad. For example, in this run, KubeJobFailingSRE was firing because the Job had 1 successful Pod and 1 failed Pod:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-deployment-validation-operator" and .involvedObject.kind == "Job") | (.firstTimestamp // .eventTime) + " " + (.count | tostring) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[]' 2025-04-27T09:00:40Z 1 Job 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866 SuccessfulCreate: Created pod: 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc 2025-04-27T09:01:06Z 1 Job 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866 SuccessfulCreate: Created pod: 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b9flcl4 2025-04-27T09:01:51Z 1 Job 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866 Completed: Job completed
The successful Pod was still there at gather-time, but the failed job was not:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.namespace == "openshift-deployment-validation-operator" and .metadata.ownerReferences[0].kind == "Job") | .status.phase + " " + .metadata.name' Succeeded 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b9flcl4
Checking the journal from the Node hosting the failed Pod:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-deployment-validation-operator" and (tostring | contains("48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc"))) | (.firstTimestamp // .eventTime) + " " + (.count | tostring) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[0]' 2025-04-27T09:00:40.293676Z null Pod 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc Scheduled: Successfully assigned openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc to ip-10-0-14-59.us-west-2.compute.internal $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/nodes/ip-10-0-14-59.us-west-2.compute.internal/journal | zgrep 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc Apr 27 09:00:40.254884 ip-10-0-14-59 kubenswrapper[2489]: I0427 09:00:40.254852 2489 kubelet.go:2483] "SyncLoop ADD" source="api" pods=["openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc"] ... Apr 27 09:00:42.993105 ip-10-0-14-59 kubenswrapper[2489]: E0427 09:00:42.993072 2489 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"pull\" with ErrImagePull: \"reading blob sha256:f586d2432afc674cb6429f8c674b6ceaeaa6e57b4dbb0d7e8141a23429d1ccd6: fetching blob: received unexpected HTTP status: 502 Bad Gateway\"" pod="openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc" podUID="ab3e4971-3427-4efd-ba2b-5337d251904c" ... Apr 27 09:01:05.143057 ip-10-0-14-59 kubenswrapper[2489]: I0427 09:01:05.143023 2489 kubelet.go:2499] "SyncLoop DELETE" source="api" pods=["openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc"] Apr 27 09:01:12.410898 ip-10-0-14-59 crio[2456]: time="2025-04-27T09:01:12.410818936Z" level=info msg="Creating container: openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc/extract" id=106de6a5-39c9-458e-8190-00f349111731 name=/runtime.v1.RuntimeService/CreateContainer ...
So the quay.io 502 seems to have caused the Kubelet to lose track of the sad Pod, and then the Job complains about the loss (I have release#64441 up hoping to collect the Job JSON), and then the alert notices the Job complaint and fires. Or something like that.
Version-Release number of selected component
Seen in 4.19 ROSA CI.
How reproducible
Seems like a rare race:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=KubeJobFailingSRE' | grep 'failures match' periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn (all) - 9 runs, 67% failed, 17% of failures match = 11% impact
Steps to Reproduce
1. Run lots of CI.
2. Have stability issues on Quay.
3. Hope you trip into whatever race is going on here.
Actual results
: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less Run #0: Failed expand_less 58s { fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:760]: Unexpected error: <errors.aggregate | len:1, cap:1>: promQL query returned unexpected results: ALERTS{alertname!~"Watchdog|...",alertstate="firing",severity!="info",namespace!="openshift-e2e-loki"} >= 1 [ { "metric": { "__name__": "ALERTS", "alertname": "KubeJobFailingSRE", "alertstate": "firing", "job_name": "48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866", "namespace": "openshift-deployment-validation-operator", "owner_name": "48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866", "prometheus": "openshift-monitoring/k8s", "severity": "warning" }, "value": [ 1745747190.593, "1" ] } ]
Expected results
Robust, quiet recovery from short registry hiccups.
- is related to
-
OCPBUGS-55902 Add KubeJobFailingSRE to allowed alerts as a temporary CI exception
-
- MODIFIED
-