Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-1371

[RHOCP4.5] Catalog pod gets stuck at Terminating after node down


    • False
    • False
    • Undefined

      1. Proposed title of this feature request

      • [RHOCP4.5] Catalog pod gets stuck at Terminating after node down

      2. What is the nature and description of the request?

      • The eviction problem of catalog pod was fixed at Bug 1862340.
      • However, catalog pod still cannot be evacuated after node down.
      • The reproduce steps are the same as Bug 1862340.
      • I made rhocp410-worker1 go down and waited for a few minutes.
      • The catalog pod which was running on the node got stuck at 'Terminating' state and it hadn't been evacuated forever.
        $ oc get pods -o wide -n openshift-marketplace
        NAME                                    READY   STATUS        RESTARTS   AGE   IP            NODE                                             NOMINATED NODE   READINESS GATES
        certified-operators-78f86c48fd-wf8ch    1/1     Running       1          8h    rhocp410-worker5.rhocp410.cluster.sub.nec.test   <none>           <none>
        community-operators-5dc55cd79c-h87mm    1/1     Running       1          8h   rhocp410-worker5.rhocp410.cluster.sub.nec.test   <none>           <none>
        marketplace-operator-5c84994668-ltp77   1/1     Running       0          8h   rhocp410-master0.rhocp410.cluster.sub.nec.test   <none>           <none>
        my-redhat-operators-lhpvl               1/1     Terminating   0          8h   rhocp410-worker1.rhocp410.cluster.sub.nec.test   <none>           <none>
        redhat-marketplace-77bdbd866-rfmjf      1/1     Running       0          8h    rhocp410-worker5.rhocp410.cluster.sub.nec.test   <none>           <none>
        redhat-operators-5885464945-krb2w       1/1     Running       0          8h   rhocp410-worker5.rhocp410.cluster.sub.nec.test   <none>           <none>
      • The catalog pod was recreated by OLM, so OLM should delete the 'Terminating' pod and recreate it on another node.
      • Or. ReplicationController or something should be used for monitoring it.
      • Catalog pod doesn't recover until user removes the 'Terminating' pod manually.
      • It reduces the availability. Please fix it.

      Version information:

        $ oc version
        Client Version: 4.5.0-202007132037.p0-592b165
        Server Version: 4.5.8
        Kubernetes Version: v1.18.3+6c42de8

      3. Why does the customer need this? (List the business requirements here)

      • NEC can still understand that a pod which has PV isn't recreated automatically without 'kubectl drain --force'.
      • But catalog pod has no PV. So there is no reason that Catalog pod prevents recreating.
      • This prevention of recreating reduces high availability of OCP. Why is this not a bug?

      4. List any affected packages or components.


              DanielMesser Daniel Messer
              rhn-support-mfuruta Masaki Furuta
              1 Vote for this issue
              5 Start watching this issue
