Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18, 4.19, 4.20, 4.21
Component/s: Etcd
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In ARO Classic we had an upgrade from 4.18.22 to 4.19.x causing long api/etcd downtime.

The sequence of the upgrade went like this:
1. upgrade starts
2. etcd on master-0 upgrades fine and comes back running
3. MCO in the meantime detects a change that requires node reboot (e.g. kernel args) and drains master-0
4. Kubelet will shutdown etcd and other containers on master-0 due to the drain 
5. During that drain, CEO installs the new revision on master-1 and kubelet just restarts the container

This results in quorum loss, as etcd on master-2 is the only left member with no other etcd running.

Version-Release number of selected component (if applicable):

Customer hit this with 4.18, but the changes in cluster-etcd-operator were introduced in 4.11 - so all currently supported versions are potentially impacted by this

How reproducible:

Not always, so far only in ARO, as the machine config has changed going from 4.18 to 4.19. See first comment below.

Steps to Reproduce:

    1. create an ARO classic cluster with 4.18.22     
    2. trigger upgrade to 4.19.15 or later

Alternatively with OCP only:
    1. create a new cluster and trigger an upgrade
    2. after the first etcd rollout finishes, apply a machine config (e.g. with kernel arguments) on the masters config pool

Actual results:

etcd loses quorum and causes downtime during an upgrade

Expected results:

etcd should keep quorum and the apiserver should be responding

Additional info:

We assume that this is related to the fairly old change in library-go quorum guard controller with

if operatorVersion != expectedOperatorVersion {
	klog.V(2).Infof("clusterOperator/etcd's operator version (%s) and expected operator version (%s) do not match. Will not create guard pods until operator reaches desired version.", operatorVersion, expectedOperatorVersion)
	return false, true, nil
}

https://github.com/openshift/cluster-etcd-operator/blame/main/pkg/operator/starter.go#L324-L327

Returning false will delete all guard pods during an upgrade. Meaning we won't be able to leverage the Pod Disruption Budgets during that period.

https://github.com/openshift/library-go/blob/release-4.18/pkg/operator/staticpod/controller/guard/guard_controller.go#L190-L212

links to

openshift/cluster-etcd-operator#1520: OCPBUGS-66334: remove version check from guard precheck

Assignee:: Dean West

Reporter:: Thomas Jungblut

Need Info From:: None

Contributors:: None

QA Contact:: Ge Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/12/03 10:59 AM

Updated:: 2025/12/04 12:25 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates