Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55635

Job Pod lost when 'SyncLoop DELETE' arrived during container creation

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.19
    • Node / Kubelet
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem

      Likely a longstanding race window, but recent Quay 502s seem to cause image-pull delays that trip some kind of Pod-tracking race, which makes the Job controller mad. For example, in this run, KubeJobFailingSRE was firing because the Job had 1 successful Pod and 1 failed Pod:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-deployment-validation-operator" and .involvedObject.kind == "Job") | (.firstTimestamp // .eventTime) + " " + (.count | tostring) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[]'
      2025-04-27T09:00:40Z 1 Job 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866 SuccessfulCreate: Created pod: 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc
      2025-04-27T09:01:06Z 1 Job 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866 SuccessfulCreate: Created pod: 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b9flcl4
      2025-04-27T09:01:51Z 1 Job 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866 Completed: Job completed
      

      The successful Pod was still there at gather-time, but the failed job was not:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.namespace == "openshift-deployment-validation-operator" and .metadata.ownerReferences[0].kind == "Job") | .status.phase + " " + .metadata.name'
      Succeeded 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b9flcl4
      

      Checking the journal from the Node hosting the failed Pod:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-deployment-validation-operator" and (tostring | contains("48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc"))) | (.firstTimestamp // .eventTime) + " " + (.count | tostring) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[0]'
      2025-04-27T09:00:40.293676Z null Pod 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc Scheduled: Successfully assigned openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc to ip-10-0-14-59.us-west-2.compute.internal
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn/1916402286188302336/artifacts/e2e-rosa-sts-ovn/gather-extra/artifacts/nodes/ip-10-0-14-59.us-west-2.compute.internal/journal | zgrep 48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc
      Apr 27 09:00:40.254884 ip-10-0-14-59 kubenswrapper[2489]: I0427 09:00:40.254852    2489 kubelet.go:2483] "SyncLoop ADD" source="api" pods=["openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc"]
      ...
      Apr 27 09:00:42.993105 ip-10-0-14-59 kubenswrapper[2489]: E0427 09:00:42.993072    2489 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"pull\" with ErrImagePull: \"reading blob sha256:f586d2432afc674cb6429f8c674b6ceaeaa6e57b4dbb0d7e8141a23429d1ccd6: fetching blob: received unexpected HTTP status: 502 Bad Gateway\"" pod="openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc" podUID="ab3e4971-3427-4efd-ba2b-5337d251904c"
      ...
      Apr 27 09:01:05.143057 ip-10-0-14-59 kubenswrapper[2489]: I0427 09:01:05.143023    2489 kubelet.go:2499] "SyncLoop DELETE" source="api" pods=["openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc"]
      Apr 27 09:01:12.410898 ip-10-0-14-59 crio[2456]: time="2025-04-27T09:01:12.410818936Z" level=info msg="Creating container: openshift-deployment-validation-operator/48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b95gsjc/extract" id=106de6a5-39c9-458e-8190-00f349111731 name=/runtime.v1.RuntimeService/CreateContainer
      ...
      

      So the quay.io 502 seems to have caused the Kubelet to lose track of the sad Pod, and then the Job complains about the loss (I have release#64441 up hoping to collect the Job JSON), and then the alert notices the Job complaint and fires. Or something like that.

      Version-Release number of selected component

      Seen in 4.19 ROSA CI.

      How reproducible

      Seems like a rare race:

      $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=KubeJobFailingSRE' | grep 'failures match'
      periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-ovn (all) - 9 runs, 67% failed, 17% of failures match = 11% impact
      

      Steps to Reproduce

      1. Run lots of CI.
      2. Have stability issues on Quay.
      3. Hope you trip into whatever race is going on here.

      Actual results

      Job failures like:

      : [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less
      Run #0: Failed expand_less	58s
      {  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:760]: Unexpected error:
          <errors.aggregate | len:1, cap:1>: 
          promQL query returned unexpected results:
          ALERTS{alertname!~"Watchdog|...",alertstate="firing",severity!="info",namespace!="openshift-e2e-loki"} >= 1
          [
            {
              "metric": {
                "__name__": "ALERTS",
                "alertname": "KubeJobFailingSRE",
                "alertstate": "firing",
                "job_name": "48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866",
                "namespace": "openshift-deployment-validation-operator",
                "owner_name": "48156a76b4769311756fea32da4971a9c05e03c9b16d03e933b69845b922866",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "warning"
              },
              "value": [
                1745747190.593,
                "1"
              ]
            }
          ]
      

      Expected results

      Robust, quiet recovery from short registry hiccups.

              harpatil@redhat.com Harshal Patil
              trking W. Trevor King
              None
              None
              Min Li Min Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: