Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-76420

Mass failures in install and testing due to SignatureValidationFailed

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [sig-devex][Feature:Templates] templateinstance readiness test should report ready soon after all annotated objects are ready [apigroup:template.openshift.io][apigroup:build.openshift.io] [Suite:openshift/conformance/parallel]

      Extreme regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 100.00% to 80.00%.

      Sample (being evaluated) Release: 4.22
      Start Time: 2026-02-02T00:00:00Z
      End Time: 2026-02-09T16:00:00Z
      Success Rate: 80.00%
      Successes: 20
      Failures: 5
      Flakes: 0
      Base (historical) Release: 4.21
      Start Time: 2026-01-04T00:00:00Z
      End Time: 2026-02-03T23:59:59Z
      Success Rate: 100.00%
      Successes: 64
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      Initial suspect for the root cause:

      [Monitor:legacy-test-framework-invariants-pathological][sig-arch] events should not repeat pathologically expand_less 	0s
      {  2 events happened too frequently
      
      event happened 21 times, something is wrong: namespace/openshift-must-gather-nwnfb node/master-2 pod/must-gather-h2twt hmsg/802e69c41f - reason/BackOff Back-off pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1d38119f000dff5ea7f725e17d366bab5be2280d13afa432b1228e1fdc389740" (06:35:09Z) result=reject 
      event happened 21 times, something is wrong: namespace/openshift-must-gather-nwnfb node/master-2 pod/must-gather-h2twt hmsg/d5bf9afefc - reason/Failed Error: ImagePullBackOff (06:35:09Z) result=reject }
      

      This indicates a problem with infra somewhere. I am triaging all known occurrences to this jira.

      It looks like this resolved roughly a day and a half ago, so we'll have to keep an eye on it.

      Filed by: dgoodwin@redhat.com

      This goes beyond metal, AI analysis:

      Signature Verification Failures in Payload 4.22.0-0.nightly-2026-02-08-005048

      Two jobs using the same payload experienced image pull failures due to sigstore signature verification. The error in both cases is identical:

      SignatureValidationFailed: unable to pull image or OCI artifact:
        pull image err: Source image rejected: A signature was required, but no signature exists;
        artifact err: provided artifact is a container image
      

      All affected images are from quay.io/openshift-release-dev/ocp-v4.0-art-dev.

      Job 1: Azure Install Failure

      Job: periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.22-periodics-e2e-azure
      Platform: Azure | Cluster: build01
      Impact: Complete install failure — cluster never came up

      Root cause chain:

      1. machine-config-daemon-pull.service failed to pull the MCD image (sha256:4f5f9954...) on all 3 master nodes
      2. kubelet could not start (depends on MCD pull)
      3. No control plane pods scheduled
      4. bootkube timed out waiting for etcd/kube-apiserver
      5. Installer exited with code 5 (bootstrap timeout)

      Evidence location: Node journal logs ({{control-plane/10.0.0.

      {4,5,6}

      /journals/journal.log}}) in the log bundle at ipi-install-install/artifacts/.

      Scale: ~5,784 total image pull errors across 3 master nodes (~1,845 on master-0, ~1,966 on master-1, ~1,973 on master-2). The pull retried every ~1.5 seconds from 01:11 to 01:57 UTC and never succeeded.

      Job 2: Metal IPI Upgrade — Test Failures

      Job: periodic-ci-openshift-release-master-nightly-4.22-upgrade-from-stable-4.21-e2e-metal-ipi-ovn-upgrade
      Platform: Metal IPI | Cluster: build09
      Upgrade: 4.21.0-0.nightly-2026-02-05-184824 → 4.22.0-0.nightly-2026-02-08-005048
      Impact: Install and upgrade succeeded, but post-upgrade tests failed en masse

      Existing running pods were unaffected (images already cached), but any new pod creation after upgrade — e2e test pods, debug pods, must-gather pods, build pods — failed to pull 4.22 images.

      Scale:

      Metric Count
      Affected image digests 3 (sha256:5d52dda6..., sha256:1d38119f..., sha256:295c2081...)
      Affected nodes 5 of 6 (master-0, master-2, worker-0, worker-1, worker-2)
      Affected namespaces 78
      Affected pods 62
      E2E test failures 104 of 3,863
      Monitor test failures 112 of 2,223

      Evidence locations:

      Cluster events (gather-extra/artifacts/events.json):

      Failed to pull image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5d52dda6d305a81830d9ddab3045adfa019241ca016e68eb9736a40206a6ebf5":
      SignatureValidationFailed: ...Source image rejected: A signature was required, but no signature exists
      

      E2E monitor tests (baremetalds-e2e-test/artifacts/junit/e2e-monitor-tests__20260208-040623.xml):

      • platform pods in ns/openshift-infra should not fail to start — 20 container start failures, recycler-for-nfs-b8npw on worker-1, cause SignatureValidationFailed starting 04:48 UTC
      • platform pods in ns/default should not fail to start — 44 container start failures, master-0-debug-p5c2p, cause SignatureValidationFailed starting 06:14 UTC
      • events should not repeat pathologically in e2e namespaces — 64 pathological events, ImagePullBackOff across e2e test pods, service-network-monitor repeated 684 times
      • should not encounter ErrImagePull in non-openshift namespace pods — 67 ErrImagePull intervals

      E2E tests (baremetalds-e2e-test/artifacts/junit/junit_e2e__20260208-040623.xml):

      • 8 oc adm must-gather tests failed directly with SignatureValidationFailed pulling sha256:1d38119f...
      • 88 tests timed out or were interrupted — cascade failures from pods stuck in ImagePullBackOff (builds, deployments, oauth test servers, etc.)

      Conclusion

      This is not a registry outage — quay.io was reachable and serving image manifests. The failure is at the signature verification layer: the ClusterImagePolicy required a sigstore signature but none could be found for these image digests. This points to either a sigstore/signature store outage or a signing pipeline failure for payload 4.22.0-0.nightly-2026-02-08-005048 around 2026-02-08 01:00–07:00 UTC.

              rhn-engineering-dgoodwin Devan Goodwin
              openshift-trt OpenShift Technical Release Team
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: