Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.15, 4.16
Component/s: OLM
Labels:

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:

4.22.0
Target Version:

4.22.0
Release Blocker:
Rejected
Sprint:
Xatu Sprint 284
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

OpenShift retrieves OLM-catalog images by tag. Rarely, corrupted or otherwise problematic content is pushed to that tag, either in the canonical registry.redhat.io/redhat/... locations or in local user mirror registries. Some versions of OLM seem to be unable to recover from that, and will stick with:

One healthy catalog Pod running the version from before the corruption.
One unhealthy catalog Pod continually struggling with the corrupted image.

Ideally, the operator should notice this situation, and when the tag changes again (whether with a roll-forward fix, or by being reverted to an earlier working version), the operator should remove the stuck, struggling Pod and launch a new one to test out the new tag target.

Version-Release number of selected component

Seen in a 4.16 cluster, but likely affects other versions to.

How reproducible

Unclear. My guess is it will be reproducible though.

Steps to Reproduce

1. Configure an ImageTagMirrorSet to back registry.redhat.io/redhat/redhat-operator-index with a local mirror.
2. Push some garbage to the v4.y tag (whichever 4.y your cluster is currently running). For example, a 4.12 catalog might confuse a 4.22 cluster.
3. See the Pod testing out the new catalog struggle, with 0/1 Ready containers.
4. Push a corrected catalog to that v4.y tag.
5. See if that new catalog is running in-cluster.

Actual results

New catalog is not running. E.g. from an exposed must-gather:

$ for X in pods/redhat-operators-*/*.yaml; do yaml2json < "${X}" | jq '{name: .metadata.name, "olm.catalogSource": .metadata.labels["olm.catalogSource"], contentImage: [.status.initContainerStatuses[] | select(.name == "extract-content").imageID][0], container: (.status.containerStatuses[0] | {restartCount, state})}'; done
{
  "name": "redhat-operators-8fl64",
  "olm.catalogSource": "redhat-operators",
  "contentImage": "registry.redhat.io/redhat/redhat-operator-index@sha256:021cbbfd9b3da554eaff9fb14f25bd4d2ed79629df31f7fea23e3f0eb326b2b5",
  "container": {
    "restartCount": 0,
    "state": {
      "running": {
        "startedAt": "2026-02-03T18:05:30Z"
      }
    }
  }
}
{
  "name": "redhat-operators-d4cmg",
  "olm.catalogSource": "",
  "contentImage": "registry.redhat.io/redhat/redhat-operator-index@sha256:210fa4ca556f36688bcab0bf949b698618f44380dc8f1e8a214ee62474664b7d",
  "container": {
    "restartCount": 240,
    "state": {
      "waiting": {
        "message": "back-off 5m0s restarting failed container=registry-server pod=redhat-operators-d4cmg_openshift-marketplace(d28f9bdb-b578-4a2e-a2ce-0cfc3d0351e7)",
        "reason": "CrashLoopBackOff"
      }
    }
  }
}

Despite newer tags for v4.16 being available, that cluster was still hammering away on the busted 210fa4c....

Expected results

Cluster automatically notices the recovered Pod after a registryPoll interval. For the 4.16 cluster that was must-gathered, that would be 10m:

$ grep -rA1 ' registryPoll' operators.coreos.com
operators.coreos.com/catalogsources/community-operators.yaml:    registryPoll:
operators.coreos.com/catalogsources/community-operators.yaml-      interval: 10m
--
operators.coreos.com/catalogsources/redhat-marketplace.yaml:    registryPoll:
operators.coreos.com/catalogsources/redhat-marketplace.yaml-      interval: 10m
--
operators.coreos.com/catalogsources/redhat-operators.yaml:    registryPoll:
operators.coreos.com/catalogsources/redhat-operators.yaml-      interval: 10m
--
operators.coreos.com/catalogsources/certified-operators.yaml:    registryPoll:
operators.coreos.com/catalogsources/certified-operators.yaml-      interval: 10m

is related to

OCPBUGS-31440 crashed catalog source pods cannot be remove

Closed

links to

kcs#7137947 (manually recovering versions without this bugfix)

Assignee:: Jordan Keister

Reporter:: W. Trevor King

Need Info From:: None

Contributors:: None

QA Contact:: Xia Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2026/02/04 9:09 PM

Updated:: 2026/02/19 3:17 PM

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide