Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11060

[4.12] Vertical Scaling: do not trigger inadvertent machine deletion during bootstrap

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • 4.13, 4.12, 4.11
    • Etcd
    • None
    • Moderate
    • No
    • 1
    • ETCD Sprint 234
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-10960. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-10351. The following is the description of the original issue:

      Forked off from OCPBUGS-8038

      From the must-gather in https://issues.redhat.com/browse/OCPBUGS-8038?focusedId=21912866&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-21912866 we could find the following logs:

      2023-03-14T21:17:11.465797715Z I0314 21:17:11.465697       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379"  ID:13492066955251100765 name:"ip-10-0-1-207.ec2.internal" peerURLs:"https://10.0.1.207:2380" clientURLs:"https://10.0.1.207:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.207:{} 10.0.1.84:{} 10.0.101.19:{}]
      2023-03-14T21:17:11.465797715Z I0314 21:17:11.465758       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine mdtest-d7vwd-master-0 since its member is still present with any of: [{InternalIP 10.0.1.207} {InternalDNS ip-10-0-1-207.ec2.internal} {Hostname ip-10-0-1-207.ec2.internal}]
      2023-03-14T21:17:13.859516308Z I0314 21:17:13.859419       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}]
      2023-03-14T21:17:23.870877844Z I0314 21:17:23.870837       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine mdtest-d7vwd-master-0
      2023-03-14T21:17:23.875474696Z I0314 21:17:23.875400       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}]
      2023-03-14T21:17:27.426349565Z I0314 21:17:27.424701       1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0
      2023-03-14T21:17:31.431703982Z I0314 21:17:31.431615       1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0
      
      

      At the same time, we were roughly finishing the bootstrap process:

      2023-03-14T21:17:11.510890775Z W0314 21:17:11.510850       1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found
      2023-03-14T21:17:12.736741689Z W0314 21:17:12.736697       1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found
      

      Which ended up with only nodes (bootstrap + master-0) being teared down, leaving just barely a quorum of two.

      The CPMSO was trying to re-create the master-0 during installation due to a label change, that caused the CEO to be fairly confused about what it is doing during the installation process. Helpful timeline from Trevor: https://docs.google.com/document/d/1o9hJT-M4HSbGbHMm5n-LjQlwVKeKfF3Ln5_ruAXWr5I/edit#heading=h.hfowu6nlc7em

              melbeher@redhat.com Mustafa Elbehery
              openshift-crt-jira-prow OpenShift Prow Bot
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: