Observed behavior: Default OpenShift OLM catalog pods do not survive outage of the node that they are currently being executed on. The pods remain in termination state, despite the tolerations that should move them away from unresponsive nodes latest after 5 minutes.
Impact: Operators can no longer be installed or update from catalogs that were previously executed on a node that has gone down.
Expected behavior: The catalog pods get automatically rescheduled on remaining nodes and their gRPC API endpoint recovers as a result.
- incorporates
-
RFE-1371 [RHOCP4.5] Catalog pod gets stuck at Terminating after node down
- Accepted
- is cloned by
-
OCPBUGS-35305 [release-4.15] OLM catalog pods do not recover from node failure
- Closed
- is depended on by
-
OCPBUGS-35305 [release-4.15] OLM catalog pods do not recover from node failure
- Closed
- relates to
-
RFE-2737 Catalogsource pods should be backed by a built-in controller
- Accepted
- split to
-
OCPBUGS-36661 OLM catalogsource pods do not recover from node failure when registryPoll is none
- Verified
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update