Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31849

cert signer controller race condition with quorum checker

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Normal Normal
    • 4.16.0
    • 4.13, 4.12, 4.14, 4.15, 4.16
    • Etcd
    • None
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • In a very rare condition the cluster-etcd-operator may break quorum by rolling out a revision to remove certificates on a node removal operation
    • Bug Fix
    • In Progress

      This has been reported by lance5890 upstream: https://github.com/openshift/cluster-etcd-operator/issues/1237

      Description of problem:

      During master node removal (out of 3), the etcd cert signer controller might still rollout a revision even though quorum is obviously going to be broken with that.
      
      Important events:
      
      08:06:26.674067       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node node3
      
      08:06:26.909780       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
      
      08:06:27.005308       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/etcd-all-certs -n openshift-etcd because it changed
      
      08:06:27.149860       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available

      Version-Release number of selected component (if applicable):

      all versions where we introduced the quorum guard (> 4.12 current applicable).

      How reproducible:

      depends on the timing of the removal and the controller runs, but somewhat frequent.    

      Steps to Reproduce:

          1. remove a master node
          2. wait for quorum loss / downtime due to revision rollout
          

      Actual results:

      quorum is lost and there is brief api downtime during the revision is rolled out    

      Expected results:

      the revisioned secret should not be updated when quorum is about to be lost

      Additional info:

          

            dwest@redhat.com Dean West
            tjungblu@redhat.com Thomas Jungblut
            Ge Liu Ge Liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: