Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10960

[4.13] Vertical Scaling: do not trigger inadvertent machine deletion during bootstrap

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • 4.13, 4.12, 4.11
    • Etcd
    • None
    • Moderate
    • No
    • 1
    • ETCD Sprint 234
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously, the `ControlPlaneMachineSet` Operator attempted to recreate a master machine before the cluster bootstrapping completed. This issue could result in the removal of the bootstrap node from the etcd cluster membership that then caused etcd quorum loss and the cluster going offline. With this update, `ControlPlaneMachineSet` Operator only recreates a master machine after the etcd Cluster Operator removes the bootstrap node. (link:https://issues.redhat.com/browse/OCPBUGS-10960[*OCPBUGS-10960*]
      Show
      Previously, the `ControlPlaneMachineSet` Operator attempted to recreate a master machine before the cluster bootstrapping completed. This issue could result in the removal of the bootstrap node from the etcd cluster membership that then caused etcd quorum loss and the cluster going offline. With this update, `ControlPlaneMachineSet` Operator only recreates a master machine after the etcd Cluster Operator removes the bootstrap node. (link: https://issues.redhat.com/browse/OCPBUGS-10960 [* OCPBUGS-10960 *]
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-10351. The following is the description of the original issue:

      Forked off from OCPBUGS-8038

      From the must-gather in https://issues.redhat.com/browse/OCPBUGS-8038?focusedId=21912866&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-21912866 we could find the following logs:

      2023-03-14T21:17:11.465797715Z I0314 21:17:11.465697       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379"  ID:13492066955251100765 name:"ip-10-0-1-207.ec2.internal" peerURLs:"https://10.0.1.207:2380" clientURLs:"https://10.0.1.207:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.207:{} 10.0.1.84:{} 10.0.101.19:{}]
      2023-03-14T21:17:11.465797715Z I0314 21:17:11.465758       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine mdtest-d7vwd-master-0 since its member is still present with any of: [{InternalIP 10.0.1.207} {InternalDNS ip-10-0-1-207.ec2.internal} {Hostname ip-10-0-1-207.ec2.internal}]
      2023-03-14T21:17:13.859516308Z I0314 21:17:13.859419       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}]
      2023-03-14T21:17:23.870877844Z I0314 21:17:23.870837       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine mdtest-d7vwd-master-0
      2023-03-14T21:17:23.875474696Z I0314 21:17:23.875400       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}]
      2023-03-14T21:17:27.426349565Z I0314 21:17:27.424701       1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0
      2023-03-14T21:17:31.431703982Z I0314 21:17:31.431615       1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0
      
      

      At the same time, we were roughly finishing the bootstrap process:

      2023-03-14T21:17:11.510890775Z W0314 21:17:11.510850       1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found
      2023-03-14T21:17:12.736741689Z W0314 21:17:12.736697       1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found
      

      Which ended up with only nodes (bootstrap + master-0) being teared down, leaving just barely a quorum of two.

      The CPMSO was trying to re-create the master-0 during installation due to a label change, that caused the CEO to be fairly confused about what it is doing during the installation process. Helpful timeline from Trevor: https://docs.google.com/document/d/1o9hJT-M4HSbGbHMm5n-LjQlwVKeKfF3Ln5_ruAXWr5I/edit#heading=h.hfowu6nlc7em

              melbeher@redhat.com Mustafa Elbehery
              openshift-crt-jira-prow OpenShift Prow Bot
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: