-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.15.0, 4.16.0
-
No
-
OPECO 249
-
1
-
Rejected
-
False
-
-
-
Bug Fix
-
Done
Description of problem:
The etcd team has introduced an e2e test that exercises a full etcd backup and restore cycle in OCP [1]. We run those tests as part of our PR builds and since 4.15 [2] (also 4.16 [3]), we have failed runs with the catalogd-controller-manager crash looping: 1 events happened too frequently event [namespace/openshift-catalogd node/ip-10-0-25-29.us-west-2.compute.internal pod/catalogd-controller-manager-768bb57cdb-nwbhr hmsg/47b381d71b - Back-off restarting failed container manager in pod catalogd-controller-manager-768bb57cdb-nwbhr_openshift-catalogd(aa38d084-ecb7-4588-bd75-f95adb4f5636)] happened 44 times} I assume something in that controller doesn't really deal gracefully with the restoration process of etcd, or the apiserver being down for some time. [1] https://github.com/openshift/origin/blob/master/test/extended/dr/recovery.go#L97 [2] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1205/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1757443629380538368 [3] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1191/pull-ci-openshift-cluster-etcd-operator-release-4.15-e2e-aws-etcd-recovery/1752293248543494144
Version-Release number of selected component (if applicable):
> 4.15
How reproducible:
always by running the test
Steps to Reproduce:
Run the test: [sig-etcd][Feature:DisasterRecovery][Suite:openshift/etcd/recovery][Timeout:2h] [Feature:EtcdRecovery][Disruptive] Recover with snapshot with two unhealthy nodes and lost quorum [Serial] and observe the event invariant failing on it crash looping
Actual results:
catalogd-controller-manager crash loops and causes our CI jobs to fail
Expected results:
our e2e job is green again and catalogd-controller-manager doesn't crash loop
Additional info:
- blocks
-
OCPBUGS-29796 catalogd crash loops after etcd restore
- Closed
- is cloned by
-
OCPBUGS-29796 catalogd crash loops after etcd restore
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update