Description of problem:
For various reasons, Pods may get evicted. Once they are evicted, the owner of the Pod should recreate the Pod so it is scheduled again.
With OLM, we can see that evicted Pods owned by Catalogsources are not rescheduled. The outcome is that all subscriptions have a "ResolutionFailed=True" condition, which hinders an upgrade of the operator. Specifically the customer is seeing an affected CatalogSource is "multicluster-engine-CENSORED_NAME-redhat-operator-index "in openshift-marketplace namespace, pod name: "multicluster-engine-CENSORED_NAME-redhat-operator-index-5ng9j"
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16.21
How reproducible:
Sometimes, when Pods are evicted on the cluster
Steps to Reproduce:
1. Set up an OpenShift Container Platform 4.16 cluster, install various Operators
2. Create a condition that a Node will evict Pods (for example by creating DiskPressure on the Node)
3. Observe if any Pods owned by CatalogSources are being evicted
Actual results:
If Pods owned by CatalogSources are being evicted, they are not recreated / rescheduled.
Expected results:
When Pods owned by CatalogSources are being evicted, they are being recreacted / rescheduled.
Additional info:
- Discussion: https://redhat-internal.slack.com/archives/C3VS0LV41/p1726170881413389?thread_ts=1726126461.479019&cid=C3VS0LV41
- Support Case with "must-gather": 04003784
- is depended on by
-
OCPBUGS-46474 Evicted Pods owned by Catalogsource are not rescheduled
- Verified
- is duplicated by
-
OCPBUGS-41217 OLM catalogsource pods do not recover from node failure when registryPoll is none
- Closed
- is related to
-
OCPBUGS-45946 fix missing Pod disruption reasons
- New
- links to