Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31849

cert signer controller race condition with quorum checker

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • 4.13, 4.12, 4.14, 4.15, 4.16
    • Etcd
    • None
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      This has been reported by lance5890 upstream: https://github.com/openshift/cluster-etcd-operator/issues/1237

      Description of problem:

      During master node removal (out of 3), the etcd cert signer controller might still rollout a revision even though quorum is obviously going to be broken with that.
      
      Important events:
      
      08:06:26.674067       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node node3
      
      08:06:26.909780       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
      
      08:06:27.005308       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/etcd-all-certs -n openshift-etcd because it changed
      
      08:06:27.149860       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available

      Version-Release number of selected component (if applicable):

      all versions where we introduced the quorum guard (> 4.12 current applicable).

      How reproducible:

      depends on the timing of the removal and the controller runs, but somewhat frequent.    

      Steps to Reproduce:

          1. remove a master node
          2. wait for quorum loss / downtime due to revision rollout
          

      Actual results:

      quorum is lost and there is brief api downtime during the revision is rolled out    

      Expected results:

      the revisioned secret should not be updated when quorum is about to be lost

      Additional info:

          

      Attachments

        Activity

          People

            dwest@redhat.com Dean West
            tjungblu@redhat.com Thomas Jungblut
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: