Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42699

Extra control plane VMs created during GCP install in 4.17+

XMLWordPrintable

    • Yes
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, if availability zones were not in a specific order in the `install-config.yaml` configuration file, the installation program would wrongly sort the zones before saving the control plane machine set manifests. When the program created the machines, additional control plane virtual machines were created to reconcile the machines into each zone. This caused a resource-constraint issue. With this release, the installation program no longer sorts availability zones so that this issue no longer occurs. (link:https://issues.redhat.com/browse/OCPBUGS-42699[*OCPBUGS-42699*])
      Show
      * Previously, if availability zones were not in a specific order in the `install-config.yaml` configuration file, the installation program would wrongly sort the zones before saving the control plane machine set manifests. When the program created the machines, additional control plane virtual machines were created to reconcile the machines into each zone. This caused a resource-constraint issue. With this release, the installation program no longer sorts availability zones so that this issue no longer occurs. (link: https://issues.redhat.com/browse/OCPBUGS-42699 [* OCPBUGS-42699 *])
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-42563. The following is the description of the original issue:

      Description of problem

      During install of multi-AZ OSD GCP clusters into customer-provided GCP projects, extra control plane nodes are created by the installer. This may be limited to a few regions, and has show up in our testing in us-west2 and asia-east2.

      When the cluster is installed, the installer provisions three control plane nodes via the cluster-api:

      • master-0 in AZ *a
      • master-1 in AZ *b
      • master-2 in AZ *c

      However, the Machine manifest for master-0 and master-2 are written with the wrong AZs (master-0 in AZ *c and master-2 in AZ *a).

      When the Machine controller in-cluster starts up and parses the manifests, it cannot find a VM for master-0 in AZ *c, or master-2 in *a, so it proceeds to try to create new VMs for those cases. master-1 is identified correctly, and unaffected.

      This results in the cluster coming up with three control plane nodes, with master-0 and master-2 having no backing Machines, three control plane Machines, with only master-1 having a Node link, and the other two listed in Provisioned state, but with no Nodes, and 5 GCP VMs for these control plane nodes:

      • master-0 in AZ *a
      • master-0 in AZ *c
      • master-1 in AZ *b
      • master-2 in AZ *a
      • master-2 in AZ *c

      This happens consistently, across multiple GCP projects, so far in us-west2 and asia-east2 ONLY.

      4.16.z clusters work as expected, as do clusters upgraded from 4.16.z to 4.17.z.

      Version-Release number of selected component

      4.17.0-rc3 - 4.17.0-rc6 have all been identified as having this issue.

      How reproducible

      100%

      Steps to Reproduce

      I'm unsure how to replicate this in vanilla cluster install, but via OSD:

      1. Create a multi-az cluster in one of the reported zones, with a supplied GCP project (not the core OSD shared project, ie: CCS, or "Customer Cloud Subscription").

      Example:

      $ ocm create cluster --provider=gcp --multi-az --ccs --secure-boot-for-shielded-vms --region asia-east2 --service-account-file ~/.config/gcloud/chcollin1-dev-acct.json --channel-group candidate --version openshift-v4.17.0-rc.3-candidate chcollin-4170rc3-gcp
      

      Requesting a GCP install via an install-config with controlPlane.platform.gcp.zones out of order seems to reliably reproduce.

      Actual results

      Install will fail in OSD, but a cluster will be created with multiple extra control-plane nodes, and the API server will respond on the master-1 node.

      Expected results

      A standard 3 control-plane-node cluster is created.

      Additional info

      We're unsure what it is about the two reported Zones or the difference between the primary OSD GCP project and customer-supplied Projects that has an effect.

      The only thing we've noticed is the install-config has the order backwards for compute nodes, but not for control plane nodes:

      {
        "controlPlane": [
          "us-west2-a",
          "us-west2-b",
          "us-west2-c"
        ],
        "compute": [
          "us-west2-c",     <--- inverted order.  Shouldn't matter when building control-plane Machines, but maybe cross-contaminated somehow?
          "us-west2-b",
          "us-west2-a"
        ],
        "platform": {
          "defaultMachinePlatform": {  <--- nothing about zones in here, although again, the controlPlane block should override any zones configured here
            "osDisk": {
              "DiskSizeGB": 0,
              "diskType": ""
            },
            "secureBoot": "Enabled",
            "type": ""
          },
          "projectID": "anishpatel",
          "region": "us-west2"
        }
      }
      

      Since we see the divergence at the asset/manifest level, we should be able to reproduce with just an openshift-install create manifests, followed by grep -r zones: or something, without having to wait for an actuall install attempt to come up and fail.

            rdossant Rafael Fonseca dos Santos
            openshift-crt-jira-prow OpenShift Prow Bot
            Jianli Wei Jianli Wei
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: