Forked off from OCPBUGS-8038
From the must-gather in https://issues.redhat.com/browse/OCPBUGS-8038?focusedId=21912866&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-21912866 we could find the following logs:
2023-03-14T21:17:11.465797715Z I0314 21:17:11.465697 1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379" ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379" ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ID:13492066955251100765 name:"ip-10-0-1-207.ec2.internal" peerURLs:"https://10.0.1.207:2380" clientURLs:"https://10.0.1.207:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.207:{} 10.0.1.84:{} 10.0.101.19:{}] 2023-03-14T21:17:11.465797715Z I0314 21:17:11.465758 1 machinedeletionhooks.go:151] skip removing the deletion hook from machine mdtest-d7vwd-master-0 since its member is still present with any of: [{InternalIP 10.0.1.207} {InternalDNS ip-10-0-1-207.ec2.internal} {Hostname ip-10-0-1-207.ec2.internal}] 2023-03-14T21:17:13.859516308Z I0314 21:17:13.859419 1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379" ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379" ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}] 2023-03-14T21:17:23.870877844Z I0314 21:17:23.870837 1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine mdtest-d7vwd-master-0 2023-03-14T21:17:23.875474696Z I0314 21:17:23.875400 1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379" ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379" ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}] 2023-03-14T21:17:27.426349565Z I0314 21:17:27.424701 1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0 2023-03-14T21:17:31.431703982Z I0314 21:17:31.431615 1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0
At the same time, we were roughly finishing the bootstrap process:
2023-03-14T21:17:11.510890775Z W0314 21:17:11.510850 1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found 2023-03-14T21:17:12.736741689Z W0314 21:17:12.736697 1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found
Which ended up with only nodes (bootstrap + master-0) being teared down, leaving just barely a quorum of two.
The CPMSO was trying to re-create the master-0 during installation due to a label change, that caused the CEO to be fairly confused about what it is doing during the installation process. Helpful timeline from Trevor: https://docs.google.com/document/d/1o9hJT-M4HSbGbHMm5n-LjQlwVKeKfF3Ln5_ruAXWr5I/edit#heading=h.hfowu6nlc7em
- blocks
-
OCPBUGS-10960 [4.13] Vertical Scaling: do not trigger inadvertent machine deletion during bootstrap
- Closed
- is caused by
-
OCPBUGS-8038 4.12 Cluster Installation Failed with Possible Race Condition
- Closed
- is cloned by
-
OCPBUGS-10960 [4.13] Vertical Scaling: do not trigger inadvertent machine deletion during bootstrap
- Closed
- relates to
-
CFE-69 User defined tags for AWS Resources GA
- Closed
- links to
-
RHEA-2023:5006 rpm