Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-75921

OLMv0 catalog recovery after troublesome image tagging

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.15, 4.16
    • OLM
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • Rejected
    • Xatu Sprint 284
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem

      OpenShift retrieves OLM-catalog images by tag. Rarely, corrupted or otherwise problematic content is pushed to that tag, either in the canonical registry.redhat.io/redhat/... locations or in local user mirror registries. Some versions of OLM seem to be unable to recover from that, and will stick with:

      • One healthy catalog Pod running the version from before the corruption.
      • One unhealthy catalog Pod continually struggling with the corrupted image.

      Ideally, the operator should notice this situation, and when the tag changes again (whether with a roll-forward fix, or by being reverted to an earlier working version), the operator should remove the stuck, struggling Pod and launch a new one to test out the new tag target.

      Version-Release number of selected component

      Seen in a 4.16 cluster, but likely affects other versions to.

      How reproducible

      Unclear. My guess is it will be reproducible though.

      Steps to Reproduce

      1. Configure an ImageTagMirrorSet to back registry.redhat.io/redhat/redhat-operator-index with a local mirror.
      2. Push some garbage to the v4.y tag (whichever 4.y your cluster is currently running). For example, a 4.12 catalog might confuse a 4.22 cluster.
      3. See the Pod testing out the new catalog struggle, with 0/1 Ready containers.
      4. Push a corrected catalog to that v4.y tag.
      5. See if that new catalog is running in-cluster.

      Actual results

      New catalog is not running. E.g. from an exposed must-gather:

      $ for X in pods/redhat-operators-*/*.yaml; do yaml2json < "${X}" | jq '{name: .metadata.name, "olm.catalogSource": .metadata.labels["olm.catalogSource"], contentImage: [.status.initContainerStatuses[] | select(.name == "extract-content").imageID][0], container: (.status.containerStatuses[0] | {restartCount, state})}'; done
      {
        "name": "redhat-operators-8fl64",
        "olm.catalogSource": "redhat-operators",
        "contentImage": "registry.redhat.io/redhat/redhat-operator-index@sha256:021cbbfd9b3da554eaff9fb14f25bd4d2ed79629df31f7fea23e3f0eb326b2b5",
        "container": {
          "restartCount": 0,
          "state": {
            "running": {
              "startedAt": "2026-02-03T18:05:30Z"
            }
          }
        }
      }
      {
        "name": "redhat-operators-d4cmg",
        "olm.catalogSource": "",
        "contentImage": "registry.redhat.io/redhat/redhat-operator-index@sha256:210fa4ca556f36688bcab0bf949b698618f44380dc8f1e8a214ee62474664b7d",
        "container": {
          "restartCount": 240,
          "state": {
            "waiting": {
              "message": "back-off 5m0s restarting failed container=registry-server pod=redhat-operators-d4cmg_openshift-marketplace(d28f9bdb-b578-4a2e-a2ce-0cfc3d0351e7)",
              "reason": "CrashLoopBackOff"
            }
          }
        }
      }
      

      Despite newer tags for v4.16 being available, that cluster was still hammering away on the busted 210fa4c....

      Expected results

      Cluster automatically notices the recovered Pod after a registryPoll interval. For the 4.16 cluster that was must-gathered, that would be 10m:

      $ grep -rA1 ' registryPoll' operators.coreos.com
      operators.coreos.com/catalogsources/community-operators.yaml:    registryPoll:
      operators.coreos.com/catalogsources/community-operators.yaml-      interval: 10m
      --
      operators.coreos.com/catalogsources/redhat-marketplace.yaml:    registryPoll:
      operators.coreos.com/catalogsources/redhat-marketplace.yaml-      interval: 10m
      --
      operators.coreos.com/catalogsources/redhat-operators.yaml:    registryPoll:
      operators.coreos.com/catalogsources/redhat-operators.yaml-      interval: 10m
      --
      operators.coreos.com/catalogsources/certified-operators.yaml:    registryPoll:
      operators.coreos.com/catalogsources/certified-operators.yaml-      interval: 10m
      

              rh-ee-jkeister Jordan Keister
              trking W. Trevor King
              None
              None
              Xia Zhao Xia Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: