Uploaded image for project: 'OpenShift Etcd'
  1. OpenShift Etcd
  2. ETCD-672

CEO cluster status is not set properly after quorum restore

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • BU Product Work
    • 3
    • False
    • None
    • False
    • OCPSTRAT-539 - Enhance recovery procedure for full control plane failure
    • ETCD Sprint 260, ETCD Sprint 261

      After running the quorum-restore script, the operator should degrade because:

      • two out of three etcd pods are crash looping
      • a single member does not allow for safe quorum 
         

      In reality, this should be reflected by errors in many controllers, yet, the operator will not degrade. 
       
      There are many errors like this in the logs:
       

      I0917 13:27:21.233106       1 status_controller.go:218] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2024-09-17T13:16:03Z","message":"StaticPodsDegraded: pod/etcd-ci-ln-c5kxl2b-72292-s2pfb-master-0 container \"etcd\" is waiting: CrashLoopBackOff: back-off 20s restarting failed container=etcd pod=etcd-ci-ln-c5kxl2b-72292-s2pfb-master-0_openshift-etcd(75d87c5fa76162bcec308be46724eeac)\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2024-09-17T13:19:57Z","message":"NodeInstallerProgressing: 3 nodes are at revision 13\nEtcdMembersProgressing: No unstarted etcd members found","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2024-09-17T12:21:04Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 13\nEtcdMembersAvailable: 1 members are available","reason":"AsExpected","status":"True","type":"...
      
      E0917 13:27:21.247115       1 base_controller.go:268] StatusSyncer_etcd *reconciliation failed*: Operation cannot be fulfilled on clusteroperators.config.openshift.io "etcd": the object has been modified; please apply your changes to the latest version and try again
      
      

       
      it seems to me that there is a stale CRD or watch cache. Restarting the CEO by deleting its pod will resolve the situation immediately.
       
      AC: 

      • the operator should be able to report the status again in all cases
         

            alray@redhat.com Allen Ray
            tjungblu@redhat.com Thomas Jungblut
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: