Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: 4.22.0
Affects Version/s: 4.18.z, 4.19.z, 4.20.z, 4.22.0, 4.21.z
Component/s: Etcd
Labels:
- rits-work

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.22.0
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

N.B. I have been testing this in various combinations and the below are my observations, but may not be entirely representative of what's happening.

The etcd operator has changed behaviour between 4.17 and 4.18 to allow removal of pending deleted machines in some manner. I believe this may be causing the cluster to degrade during an OnDelete rollout where folks are trying to replace multiple machines simultaneously. This behaviour is a regression which we should aim to fix as the results are pretty bad.

In 4.17, if multiple machines were being replaced (someone had called oc delete on all the machines at once), the etcd operator would wait for the all revisions to rollout and for the cluster to stabilise before removing any lifecycle hooks from the machines.

Then we would see a "big bang" where all three machines remove lifecycle hooks together, and then the Machine API proceeds to remove the three machines.

At no point during this process do we see any API errors.

In 4.18, it appears that the etcd operator removes the pre-drain hooks while the cluster is still rolling out some revisions.

A first machine has its pre-drain hook removed and starts to be drained. At this point, that etcd member has been removed from the cluster, and is unhealthy. I believe the revision rollouts are still including it though and this makes the etcd cluster health impacted by this.

The cluster continues to a point where we end up with KAS installer stuck in pending, not all etcd pods get created, failing KAS pods (linked to an etcd member removed from the cluster) blocks progress of removal of machines due to PDBs from KASO, and then the cluster API starts timing out requests (quorum loss?).

If I go to the cloud provider and terminate the old three masters, the cluster does recover correctly, but this shouldn't be necessary and the cluster should never return the bad requests.

Version-Release number of selected component (if applicable):

How reproducible:

    100%

Steps to Reproduce:

1. Start a 4.18 cluster
2. Use `oc edit` to change the controlplanemachineset `cluster` `.spec.strategy.type` to `OnDelete`
3. Use `oc delete` to delete all three control plane machines
4. Observe the cluster

Actual results:

    Cluster ends up in a degraded state

Expected results:

    Cluster should never be degraded, rollout should complete

Additional info:

blocks

OCPBUGS-77097 [4.21] Cluster degrades when deleting all three control plane machines simultaneously (regression from 4.17)

POST

is cloned by

OCPBUGS-77097 [4.21] Cluster degrades when deleting all three control plane machines simultaneously (regression from 4.17)

POST

links to

openshift/cluster-etcd-operator#1540: DNM: OCPBUGS-74151: Wait for revision stability before removing etcd members

openshift/origin#30760: OCPBUGS-74151: Add test for CPMS OnDelete strategy with full master replacement

Assignee:: Haseeb Tariq

Reporter:: Joel Speed

QA Contact:: Ge Liu

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2026/01/20 12:14 PM

Updated:: 2026/02/23 2:47 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates