Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
None

Work Type:
BU Product Work
Story Points:
3
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
Disaster Recovery Automation
Feature Link:
OCPSTRAT-539 - Enhance recovery procedure for full control plane failure
Intelligence Requested:
Market:

Sprint:
ETCD Sprint 260, ETCD Sprint 261

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

After running the quorum-restore script, the operator should degrade because:

two out of three etcd pods are crash looping
a single member does not allow for safe quorum

In reality, this should be reflected by errors in many controllers, yet, the operator will not degrade.

There are many errors like this in the logs:

I0917 13:27:21.233106       1 status_controller.go:218] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2024-09-17T13:16:03Z","message":"StaticPodsDegraded: pod/etcd-ci-ln-c5kxl2b-72292-s2pfb-master-0 container \"etcd\" is waiting: CrashLoopBackOff: back-off 20s restarting failed container=etcd pod=etcd-ci-ln-c5kxl2b-72292-s2pfb-master-0_openshift-etcd(75d87c5fa76162bcec308be46724eeac)\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2024-09-17T13:19:57Z","message":"NodeInstallerProgressing: 3 nodes are at revision 13\nEtcdMembersProgressing: No unstarted etcd members found","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2024-09-17T12:21:04Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 13\nEtcdMembersAvailable: 1 members are available","reason":"AsExpected","status":"True","type":"...

E0917 13:27:21.247115       1 base_controller.go:268] StatusSyncer_etcd *reconciliation failed*: Operation cannot be fulfilled on clusteroperators.config.openshift.io "etcd": the object has been modified; please apply your changes to the latest version and try again

it seems to me that there is a stale CRD or watch cache. Restarting the CEO by deleting its pod will resolve the situation immediately.

AC:

the operator should be able to report the status again in all cases

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

log.txt
2.86 MB
2024/09/17 2:46 PM

links to

openshift/library-go#1825: ETCD-672: use consistent get on status conflict

Assignee:: Allen Ray

Reporter:: Thomas Jungblut

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/09/17 1:00 PM

Updated:: 2024/10/15 4:06 PM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates