-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.18, 4.19, 4.20, 4.21
-
None
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
In ARO Classic we had an upgrade from 4.18.22 to 4.19.x causing long api/etcd downtime.
The sequence of the upgrade went like this:
1. upgrade starts
2. etcd on master-0 upgrades fine and comes back running
3. MCO in the meantime detects a change that requires node reboot (e.g. kernel args) and drains master-0
4. Kubelet will shutdown etcd and other containers on master-0 due to the drain
5. During that drain, CEO installs the new revision on master-1 and kubelet just restarts the container
This results in quorum loss, as etcd on master-2 is the only left member with no other etcd running.
Version-Release number of selected component (if applicable):
Customer hit this with 4.18, but the changes in cluster-etcd-operator were introduced in 4.11 - so all currently supported versions are potentially impacted by this
How reproducible:
Not always, so far only in ARO, as the machine config has changed going from 4.18 to 4.19. See first comment below.
Steps to Reproduce:
1. create an ARO classic cluster with 4.18.22
2. trigger upgrade to 4.19.15 or later
Alternatively with OCP only:
1. create a new cluster and trigger an upgrade
2. after the first etcd rollout finishes, apply a machine config (e.g. with kernel arguments) on the masters config pool
Actual results:
etcd loses quorum and causes downtime during an upgrade
Expected results:
etcd should keep quorum and the apiserver should be responding
Additional info:
We assume that this is related to the fairly old change in library-go quorum guard controller with
if operatorVersion != expectedOperatorVersion {
klog.V(2).Infof("clusterOperator/etcd's operator version (%s) and expected operator version (%s) do not match. Will not create guard pods until operator reaches desired version.", operatorVersion, expectedOperatorVersion)
return false, true, nil
}
https://github.com/openshift/cluster-etcd-operator/blame/main/pkg/operator/starter.go#L324-L327
Returning false will delete all guard pods during an upgrade. Meaning we won't be able to leverage the Pod Disruption Budgets during that period.
https://github.com/openshift/library-go/blob/release-4.18/pkg/operator/staticpod/controller/guard/guard_controller.go#L190-L212