Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.19
Component/s: Cloud Compute / ControlPlaneMachineSet
Labels:
- ServiceDeliveryImpact
- pmr-ai

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None
Architecture:

x86_64

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem
A compliance controller ordered the deletion of all three of an OCP cluster's control plane machines within a ~1hr span (expected behavior because the machines were too old per compliance rules). Etcd quorum was lost during this time span because "old" machines were drained/deleted before replacement machines could fully join the etcd cluster.

Version-Release number of selected component (if applicable):
OCP v4.19.17

How reproducible
Unclear

Steps to Reproduce

Do oc delete machine -n openshift-machine-api master-0
Wait 15 minutes
Do oc delete machine -n openshift-machine-api master-1
Wait 15 minutes
Do oc delete machine -n openshift-machine-api master-2

Actual results
Control plane machines are drained/shut-down too soon, leaving the cluster with only 1-2 healthy etcd members until the replacement node has fully joined the etcd cluster. etcdNoLeader and etcdInsufficientMembers alerts fire intermittently.

Expected results
Deleted control plane machines aren't drained until a replacement machine has fully provisioned, joined the cluster as a node, and joined the etcd cluster as a member. Ideally, this means that there would briefly be 4 etcd members (assuming the usual 3-node control plane), but even in the worst case, CPMS should ensure there are never fewer than 2 healthy etcd members.

Additional info
This graph screenshot shows the observed behavior. The three "rapid" machine deletions took place between ~8pm and ~9pm. The unacceptable period (when there was only one healthy etcd member) occurs around 8:12pm. The brief spike to 4 members around 10pm was a result of a test deletion we performed after the bug was observed, and it demonstrates the expected behavior.

Assignee:: Damiano Donati

Reporter:: Anthony Byrne

Need Info From:: None

Contributors:: None

QA Contact:: Zhaohua Sun

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/11/26 8:48 PM

Updated:: 2025/11/26 9:21 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates