Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Done
Priority: Undefined
Fix Version/s: openshift-4.9
Affects Version/s: None
Component/s: OLM
Labels:
- rfe-approved-to-closed-done

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

1. Proposed title of this feature request

[RHOCP4.5] Catalog pod gets stuck at Terminating after node down

2. What is the nature and description of the request?

The eviction problem of catalog pod was fixed at Bug 1862340.
However, catalog pod still cannot be evacuated after node down.

The reproduce steps are the same as Bug 1862340.

I made rhocp410-worker1 go down and waited for a few minutes.
The catalog pod which was running on the node got stuck at 'Terminating' state and it hadn't been evacuated forever.

  $ oc get pods -o wide -n openshift-marketplace
  NAME                                    READY   STATUS        RESTARTS   AGE   IP            NODE                                             NOMINATED NODE   READINESS GATES
  certified-operators-78f86c48fd-wf8ch    1/1     Running       1          8h    10.131.2.7    rhocp410-worker5.rhocp410.cluster.sub.nec.test   <none>           <none>
  community-operators-5dc55cd79c-h87mm    1/1     Running       1          8h    10.131.2.11   rhocp410-worker5.rhocp410.cluster.sub.nec.test   <none>           <none>
  marketplace-operator-5c84994668-ltp77   1/1     Running       0          8h    10.131.0.24   rhocp410-master0.rhocp410.cluster.sub.nec.test   <none>           <none>
  my-redhat-operators-lhpvl               1/1     Terminating   0          8h    10.128.0.21   rhocp410-worker1.rhocp410.cluster.sub.nec.test   <none>           <none>
  redhat-marketplace-77bdbd866-rfmjf      1/1     Running       0          8h    10.131.2.9    rhocp410-worker5.rhocp410.cluster.sub.nec.test   <none>           <none>
  redhat-operators-5885464945-krb2w       1/1     Running       0          8h    10.131.2.10   rhocp410-worker5.rhocp410.cluster.sub.nec.test   <none>           <none>

The catalog pod was recreated by OLM, so OLM should delete the 'Terminating' pod and recreate it on another node.
Or. ReplicationController or something should be used for monitoring it.

Catalog pod doesn't recover until user removes the 'Terminating' pod manually.
It reduces the availability. Please fix it.

Version information:

  $ oc version
  Client Version: 4.5.0-202007132037.p0-592b165
  Server Version: 4.5.8
  Kubernetes Version: v1.18.3+6c42de8

3. Why does the customer need this? (List the business requirements here)

NEC can still understand that a pod which has PV isn't recreated automatically without 'kubectl drain --force'.
But catalog pod has no PV. So there is no reason that Catalog pod prevents recreating.
This prevention of recreating reduces high availability of OCP. Why is this not a bug?

4. List any affected packages or components.

OLM

is incorporated by

OCPBUGS-32183 OLM catalog pods do not recover from node failure

Closed

Assignee:: Daniel Messer

Reporter:: Masaki Furuta

Need Info From:: None

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2020/11/09 2:33 AM

Updated:: 2025/09/13 1:59 PM

Resolved:: 2021/03/24 10:01 AM

Target start:: None

Target end:: None

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates