Description of problem
OpenShift retrieves OLM-catalog images by tag. Rarely, corrupted or otherwise problematic content is pushed to that tag, either in the canonical registry.redhat.io/redhat/... locations or in local user mirror registries. Some versions of OLM seem to be unable to recover from that, and will stick with:
- One healthy catalog Pod running the version from before the corruption.
- One unhealthy catalog Pod continually struggling with the corrupted image.
Ideally, the operator should notice this situation, and when the tag changes again (whether with a roll-forward fix, or by being reverted to an earlier working version), the operator should remove the stuck, struggling Pod and launch a new one to test out the new tag target.
Version-Release number of selected component
Seen in a 4.16 cluster, but likely affects other versions to.
How reproducible
Unclear. My guess is it will be reproducible though.
Steps to Reproduce
1. Configure an ImageTagMirrorSet to back registry.redhat.io/redhat/redhat-operator-index with a local mirror.
2. Push some garbage to the v4.y tag (whichever 4.y your cluster is currently running). For example, a 4.12 catalog might confuse a 4.22 cluster.
3. See the Pod testing out the new catalog struggle, with 0/1 Ready containers.
4. Push a corrected catalog to that v4.y tag.
5. See if that new catalog is running in-cluster.
Actual results
New catalog is not running. E.g. from an exposed must-gather:
$ for X in pods/redhat-operators-*/*.yaml; do yaml2json < "${X}" | jq '{name: .metadata.name, "olm.catalogSource": .metadata.labels["olm.catalogSource"], contentImage: [.status.initContainerStatuses[] | select(.name == "extract-content").imageID][0], container: (.status.containerStatuses[0] | {restartCount, state})}'; done
{
"name": "redhat-operators-8fl64",
"olm.catalogSource": "redhat-operators",
"contentImage": "registry.redhat.io/redhat/redhat-operator-index@sha256:021cbbfd9b3da554eaff9fb14f25bd4d2ed79629df31f7fea23e3f0eb326b2b5",
"container": {
"restartCount": 0,
"state": {
"running": {
"startedAt": "2026-02-03T18:05:30Z"
}
}
}
}
{
"name": "redhat-operators-d4cmg",
"olm.catalogSource": "",
"contentImage": "registry.redhat.io/redhat/redhat-operator-index@sha256:210fa4ca556f36688bcab0bf949b698618f44380dc8f1e8a214ee62474664b7d",
"container": {
"restartCount": 240,
"state": {
"waiting": {
"message": "back-off 5m0s restarting failed container=registry-server pod=redhat-operators-d4cmg_openshift-marketplace(d28f9bdb-b578-4a2e-a2ce-0cfc3d0351e7)",
"reason": "CrashLoopBackOff"
}
}
}
}
Despite newer tags for v4.16 being available, that cluster was still hammering away on the busted 210fa4c....
Expected results
Cluster automatically notices the recovered Pod after a registryPoll interval. For the 4.16 cluster that was must-gathered, that would be 10m:
$ grep -rA1 ' registryPoll' operators.coreos.com operators.coreos.com/catalogsources/community-operators.yaml: registryPoll: operators.coreos.com/catalogsources/community-operators.yaml- interval: 10m -- operators.coreos.com/catalogsources/redhat-marketplace.yaml: registryPoll: operators.coreos.com/catalogsources/redhat-marketplace.yaml- interval: 10m -- operators.coreos.com/catalogsources/redhat-operators.yaml: registryPoll: operators.coreos.com/catalogsources/redhat-operators.yaml- interval: 10m -- operators.coreos.com/catalogsources/certified-operators.yaml: registryPoll: operators.coreos.com/catalogsources/certified-operators.yaml- interval: 10m
- is related to
-
OCPBUGS-31440 crashed catalog source pods cannot be remove
-
- Closed
-
- links to