[OCPBUGS-29796] catalogd crash loops after etcd restore - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.15.z
Affects Version/s: 4.15.0, 4.16.0
Component/s: OLM / Registry
Labels:
- triaged

Regression:
No
Sprint:
OPECO 249, Phlogiston 250
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-29453~~. The following is the description of the original issue:
—
Description of problem:

The etcd team has introduced an e2e test that exercises a full etcd backup and restore cycle in OCP [1].

We run those tests as part of our PR builds and since 4.15 [2] (also 4.16 [3]), we have failed runs with the catalogd-controller-manager crash looping:

1 events happened too frequently
event [namespace/openshift-catalogd node/ip-10-0-25-29.us-west-2.compute.internal pod/catalogd-controller-manager-768bb57cdb-nwbhr hmsg/47b381d71b - Back-off restarting failed container manager in pod catalogd-controller-manager-768bb57cdb-nwbhr_openshift-catalogd(aa38d084-ecb7-4588-bd75-f95adb4f5636)] happened 44 times}


I assume something in that controller doesn't really deal gracefully with the restoration process of etcd, or the apiserver being down for some time.


[1] https://github.com/openshift/origin/blob/master/test/extended/dr/recovery.go#L97

[2] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1205/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1757443629380538368

[3] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1191/pull-ci-openshift-cluster-etcd-operator-release-4.15-e2e-aws-etcd-recovery/1752293248543494144

Version-Release number of selected component (if applicable):

> 4.15

How reproducible:

always by running the test

Steps to Reproduce:

Run the test:

[sig-etcd][Feature:DisasterRecovery][Suite:openshift/etcd/recovery][Timeout:2h] [Feature:EtcdRecovery][Disruptive] Recover with snapshot with two unhealthy nodes and lost quorum [Serial]     

and observe the event invariant failing on it crash looping

Actual results:

catalogd-controller-manager crash loops and causes our CI jobs to fail

Expected results:

our e2e job is green again and catalogd-controller-manager doesn't crash loop

Additional info:

clones

OCPBUGS-29453 catalogd crash loops after etcd restore

Closed

is blocked by

OCPBUGS-29453 catalogd crash loops after etcd restore

Closed

links to

openshift/operator-framework-catalogd#44: [release-4.15] OCPBUGS-29796: UPSTREAM: 231: make garbage collection a runnable

RHSA-2024:1210 OpenShift Container Platform 4.15.z security update

Assignee:: Bryce Palmer

Reporter:: OpenShift Prow Bot

QA Contact:: Jia Fan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/02/21 10:03 PM

Updated:: 2024/04/15 6:20 PM

Resolved:: 2024/03/13 3:32 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide