[OCPBUGS-29453] catalogd crash loops after etcd restore - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.16.0
Affects Version/s: 4.15.0, 4.16.0
Component/s: OLM / Registry
Labels:
- triaged

Regression:
No
Sprint:
OPECO 249
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the catalogd component could crash loop after an etcd restore. This was due to the garbage collection process causing a looping failure state when the API server was unreachable. This bug fix updates catalogd to add a retry loop, and as a result catalogd no longer crashes in this scenario. (link:https://issues.redhat.com/browse/OCPBUGS-29453[*~~OCPBUGS-29453~~*])

Show
* Previously, the catalogd component could crash loop after an etcd restore. This was due to the garbage collection process causing a looping failure state when the API server was unreachable. This bug fix updates catalogd to add a retry loop, and as a result catalogd no longer crashes in this scenario. (link: https://issues.redhat.com/browse/OCPBUGS-29453 [* OCPBUGS-29453 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The etcd team has introduced an e2e test that exercises a full etcd backup and restore cycle in OCP [1].

We run those tests as part of our PR builds and since 4.15 [2] (also 4.16 [3]), we have failed runs with the catalogd-controller-manager crash looping:

1 events happened too frequently
event [namespace/openshift-catalogd node/ip-10-0-25-29.us-west-2.compute.internal pod/catalogd-controller-manager-768bb57cdb-nwbhr hmsg/47b381d71b - Back-off restarting failed container manager in pod catalogd-controller-manager-768bb57cdb-nwbhr_openshift-catalogd(aa38d084-ecb7-4588-bd75-f95adb4f5636)] happened 44 times}


I assume something in that controller doesn't really deal gracefully with the restoration process of etcd, or the apiserver being down for some time.


[1] https://github.com/openshift/origin/blob/master/test/extended/dr/recovery.go#L97

[2] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1205/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1757443629380538368

[3] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1191/pull-ci-openshift-cluster-etcd-operator-release-4.15-e2e-aws-etcd-recovery/1752293248543494144

Version-Release number of selected component (if applicable):

> 4.15

How reproducible:

always by running the test

Steps to Reproduce:

Run the test:

[sig-etcd][Feature:DisasterRecovery][Suite:openshift/etcd/recovery][Timeout:2h] [Feature:EtcdRecovery][Disruptive] Recover with snapshot with two unhealthy nodes and lost quorum [Serial]     

and observe the event invariant failing on it crash looping

Actual results:

catalogd-controller-manager crash loops and causes our CI jobs to fail

Expected results:

our e2e job is green again and catalogd-controller-manager doesn't crash loop

Additional info:

blocks

OCPBUGS-29796 catalogd crash loops after etcd restore

Closed

is cloned by

OCPBUGS-29796 catalogd crash loops after etcd restore

Closed

links to

openshift/operator-framework-catalogd#42: OCPBUGS-29453: UPSTREAM: 231: make garbage collection a runnable

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Bryce Palmer

Reporter:: Thomas Jungblut

QA Contact:: Jia Fan

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/02/14 8:54 AM

Updated:: 2024/06/27 11:37 AM

Resolved:: 2024/06/27 11:37 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide