Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-66334

etcd loses quorum during upgrades with MCO control plane customization

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.18, 4.19, 4.20, 4.21
    • Etcd
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      In ARO Classic we had an upgrade from 4.18.22 to 4.19.x causing long api/etcd downtime.
      
      The sequence of the upgrade went like this:
      1. upgrade starts
      2. etcd on master-0 upgrades fine and comes back running
      3. MCO in the meantime detects a change that requires node reboot (e.g. kernel args) and drains master-0
      4. Kubelet will shutdown etcd and other containers on master-0 due to the drain 
      5. During that drain, CEO installs the new revision on master-1 and kubelet just restarts the container
      
      This results in quorum loss, as etcd on master-2 is the only left member with no other etcd running.
      
          

      Version-Release number of selected component (if applicable):

      Customer hit this with 4.18, but the changes in cluster-etcd-operator were introduced in 4.11 - so all currently supported versions are potentially impacted by this

      How reproducible:

      Not always, so far only in ARO, as the machine config has changed going from 4.18 to 4.19. See first comment below.

      Steps to Reproduce:

          1. create an ARO classic cluster with 4.18.22     
          2. trigger upgrade to 4.19.15 or later
      
      Alternatively with OCP only:
          1. create a new cluster and trigger an upgrade
          2. after the first etcd rollout finishes, apply a machine config (e.g. with kernel arguments) on the masters config pool
      
      

      Actual results:

      etcd loses quorum and causes downtime during an upgrade    

      Expected results:

      etcd should keep quorum and the apiserver should be responding    

      Additional info:

      We assume that this is related to the fairly old change in library-go quorum guard controller with
      
      if operatorVersion != expectedOperatorVersion {
      	klog.V(2).Infof("clusterOperator/etcd's operator version (%s) and expected operator version (%s) do not match. Will not create guard pods until operator reaches desired version.", operatorVersion, expectedOperatorVersion)
      	return false, true, nil
      }
      
      https://github.com/openshift/cluster-etcd-operator/blame/main/pkg/operator/starter.go#L324-L327
      
      Returning false will delete all guard pods during an upgrade. Meaning we won't be able to leverage the Pod Disruption Budgets during that period.
      
      https://github.com/openshift/library-go/blob/release-4.18/pkg/operator/staticpod/controller/guard/guard_controller.go#L190-L212
      
      

              dwest@redhat.com Dean West
              tjungblu@redhat.com Thomas Jungblut
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: