Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.14, 4.15
Component/s: OLM
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:
None
Story Points:
None
Severity:
None
Regression:
Yes

Target Backport Versions:

4.15
Target Version:

4.16.0
Release Blocker:
Rejected
Sprint:
Rasputin OLM Sprint 252
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, default {olm} catalog pods backed by a `CatalogSource` object would not survive an outage of the node that they were being run on. The pods remained in termination state, despite the tolerations that should move them. This caused Operators to no longer be able to be installed or updated from related catalogs. This bug fix updates {olm} so catalog pods that get stuck in this state are deleted. As a result, catalog pods now correctly recover from planned or unplanned node maintenance. (link:https://issues.redhat.com/browse/OCPBUGS-32183[*~~OCPBUGS-32183~~*])

Show
* Previously, default {olm} catalog pods backed by a `CatalogSource` object would not survive an outage of the node that they were being run on. The pods remained in termination state, despite the tolerations that should move them. This caused Operators to no longer be able to be installed or updated from related catalogs. This bug fix updates {olm} so catalog pods that get stuck in this state are deleted. As a result, catalog pods now correctly recover from planned or unplanned node maintenance. (link: https://issues.redhat.com/browse/OCPBUGS-32183 [* OCPBUGS-32183 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Observed behavior: Default OpenShift OLM catalog pods do not survive outage of the node that they are currently being executed on. The pods remain in termination state, despite the tolerations that should move them away from unresponsive nodes latest after 5 minutes.

Impact: Operators can no longer be installed or update from catalogs that were previously executed on a node that has gone down.

Expected behavior: The catalog pods get automatically rescheduled on remaining nodes and their gRPC API endpoint recovers as a result.

incorporates

RFE-1371 [RHOCP4.5] Catalog pod gets stuck at Terminating after node down

Closed

is cloned by

OCPBUGS-35305 [release-4.15] OLM catalog pods do not recover from node failure

Closed

is depended on by

OCPBUGS-35305 [release-4.15] OLM catalog pods do not recover from node failure

Closed

relates to

RFE-2737 Catalogsource pods should be backed by a built-in controller

Approved

split to

OCPBUGS-36661 OLM catalogsource pods do not recover from node failure when registryPoll is none

Closed

links to

[bugzilla 1878025]

operator-framework/operator-lifecycle-manager#3201

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Synchronize From Upstream Repositories

Test case OCP-73201

(5 links to)

Assignee:: Joe Lanford

Reporter:: Daniel Messer

Need Info From:: None

Contributors:: None

QA Contact:: Jian Zhang

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 13 Start watching this issue

Created:: 2021/03/24 10:09 AM

Updated:: 2025/09/13 1:59 PM

Resolved:: 2024/06/27 11:44 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates