-
Feature Request
-
Resolution: Done
-
Undefined
-
None
-
None
-
False
-
False
-
Undefined
-
-
1. Proposed title of this feature request
- [RHOCP4.5] Catalog pod gets stuck at Terminating after node down
2. What is the nature and description of the request?
- The eviction problem of catalog pod was fixed at Bug 1862340.
- However, catalog pod still cannot be evacuated after node down.
- The reproduce steps are the same as Bug 1862340.
- I made rhocp410-worker1 go down and waited for a few minutes.
- The catalog pod which was running on the node got stuck at 'Terminating' state and it hadn't been evacuated forever.
$ oc get pods -o wide -n openshift-marketplace
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
certified-operators-78f86c48fd-wf8ch 1/1 Running 1 8h 10.131.2.7 rhocp410-worker5.rhocp410.cluster.sub.nec.test <none> <none>
community-operators-5dc55cd79c-h87mm 1/1 Running 1 8h 10.131.2.11 rhocp410-worker5.rhocp410.cluster.sub.nec.test <none> <none>
marketplace-operator-5c84994668-ltp77 1/1 Running 0 8h 10.131.0.24 rhocp410-master0.rhocp410.cluster.sub.nec.test <none> <none>
my-redhat-operators-lhpvl 1/1 Terminating 0 8h 10.128.0.21 rhocp410-worker1.rhocp410.cluster.sub.nec.test <none> <none>
redhat-marketplace-77bdbd866-rfmjf 1/1 Running 0 8h 10.131.2.9 rhocp410-worker5.rhocp410.cluster.sub.nec.test <none> <none>
redhat-operators-5885464945-krb2w 1/1 Running 0 8h 10.131.2.10 rhocp410-worker5.rhocp410.cluster.sub.nec.test <none> <none>
- The catalog pod was recreated by OLM, so OLM should delete the 'Terminating' pod and recreate it on another node.
- Or. ReplicationController or something should be used for monitoring it.
- Catalog pod doesn't recover until user removes the 'Terminating' pod manually.
- It reduces the availability. Please fix it.
Version information:
$ oc version Client Version: 4.5.0-202007132037.p0-592b165 Server Version: 4.5.8 Kubernetes Version: v1.18.3+6c42de8
3. Why does the customer need this? (List the business requirements here)
- NEC can still understand that a pod which has PV isn't recreated automatically without 'kubectl drain --force'.
- But catalog pod has no PV. So there is no reason that Catalog pod prevents recreating.
- This prevention of recreating reduces high availability of OCP. Why is this not a bug?
4. List any affected packages or components.
OLM
- is incorporated by
-
OCPBUGS-32183 OLM catalog pods do not recover from node failure
- Closed