Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13, 4.12, 4.11
Component/s: Etcd
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:

4.14.0
Release Blocker:
Rejected
Sprint:
ETCD Sprint 233, ETCD Sprint 234
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Forked off from ~~OCPBUGS-8038~~

From the must-gather in https://issues.redhat.com/browse/OCPBUGS-8038?focusedId=21912866&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-21912866 we could find the following logs:

2023-03-14T21:17:11.465797715Z I0314 21:17:11.465697       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379"  ID:13492066955251100765 name:"ip-10-0-1-207.ec2.internal" peerURLs:"https://10.0.1.207:2380" clientURLs:"https://10.0.1.207:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.207:{} 10.0.1.84:{} 10.0.101.19:{}]
2023-03-14T21:17:11.465797715Z I0314 21:17:11.465758       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine mdtest-d7vwd-master-0 since its member is still present with any of: [{InternalIP 10.0.1.207} {InternalDNS ip-10-0-1-207.ec2.internal} {Hostname ip-10-0-1-207.ec2.internal}]
2023-03-14T21:17:13.859516308Z I0314 21:17:13.859419       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}]
2023-03-14T21:17:23.870877844Z I0314 21:17:23.870837       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine mdtest-d7vwd-master-0
2023-03-14T21:17:23.875474696Z I0314 21:17:23.875400       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}]
2023-03-14T21:17:27.426349565Z I0314 21:17:27.424701       1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0
2023-03-14T21:17:31.431703982Z I0314 21:17:31.431615       1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0

At the same time, we were roughly finishing the bootstrap process:

2023-03-14T21:17:11.510890775Z W0314 21:17:11.510850       1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found
2023-03-14T21:17:12.736741689Z W0314 21:17:12.736697       1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found

Which ended up with only nodes (bootstrap + master-0) being teared down, leaving just barely a quorum of two.

The CPMSO was trying to re-create the master-0 during installation due to a label change, that caused the CEO to be fairly confused about what it is doing during the installation process. Helpful timeline from Trevor: https://docs.google.com/document/d/1o9hJT-M4HSbGbHMm5n-LjQlwVKeKfF3Ln5_ruAXWr5I/edit#heading=h.hfowu6nlc7em

blocks

OCPBUGS-10960 [4.13] Vertical Scaling: do not trigger inadvertent machine deletion during bootstrap

Closed

is caused by

OCPBUGS-8038 4.12 Cluster Installation Failed with Possible Race Condition

Closed

is cloned by

OCPBUGS-10960 [4.13] Vertical Scaling: do not trigger inadvertent machine deletion during bootstrap

Closed

relates to

CFE-69 User defined tags for AWS Resources GA

Closed

links to

openshift/cluster-etcd-operator#1027: OCPBUGS-10351: skip machine deletion during boostrap

RHEA-2023:5006 rpm

(1 links to)

Assignee:: Mustafa Elbehery

Reporter:: Thomas Jungblut

QA Contact:: Ge Liu

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2023/03/15 4:37 PM

Updated:: 2025/07/27 11:48 AM

Resolved:: 2023/10/31 12:58 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates