Observed behavior: Default OpenShift OLM catalog pods do not survive outage of the node that they are currently being executed on. The pods remain in termination state, despite the tolerations that should move them away from unresponsive nodes latest after 5 minutes.
Impact: Operators can no longer be installed or update from catalogs that were previously executed on a node that has gone down.
Expected behavior: The catalog pods get automatically rescheduled on remaining nodes and their gRPC API endpoint recovers as a result.
- clones
-
OCPBUGS-32183 OLM catalog pods do not recover from node failure
- Closed
- depends on
-
OCPBUGS-32183 OLM catalog pods do not recover from node failure
- Closed
- links to
-
RHBA-2024:4151 OpenShift Container Platform 4.15.z bug fix update