Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42563

Extra control plane VMs created during GCP install in 4.17+

XMLWordPrintable

    • Yes
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • In Progress

      Description of problem

      During install of multi-AZ OSD GCP clusters into customer-provided GCP projects, extra control plane nodes are created by the installer. This may be limited to a few regions, and has show up in our testing in us-west2 and asia-east2.

      When the cluster is installed, the installer provisions three control plane nodes via the cluster-api:

      • master-0 in AZ *a
      • master-1 in AZ *b
      • master-2 in AZ *c

      However, the Machine manifest for master-0 and master-2 are written with the wrong AZs (master-0 in AZ *c and master-2 in AZ *a).

      When the Machine controller in-cluster starts up and parses the manifests, it cannot find a VM for master-0 in AZ *c, or master-2 in *a, so it proceeds to try to create new VMs for those cases. master-1 is identified correctly, and unaffected.

      This results in the cluster coming up with three control plane nodes, with master-0 and master-2 having no backing Machines, three control plane Machines, with only master-1 having a Node link, and the other two listed in Provisioned state, but with no Nodes, and 5 GCP VMs for these control plane nodes:

      • master-0 in AZ *a
      • master-0 in AZ *c
      • master-1 in AZ *b
      • master-2 in AZ *a
      • master-2 in AZ *c

      This happens consistently, across multiple GCP projects, so far in us-west2 and asia-east2 ONLY.

      4.16.z clusters work as expected, as do clusters upgraded from 4.16.z to 4.17.z.

      Version-Release number of selected component

      4.17.0-rc3 - 4.17.0-rc6 have all been identified as having this issue.

      How reproducible

      100%

      Steps to Reproduce

      I'm unsure how to replicate this in vanilla cluster install, but via OSD:

      1. Create a multi-az cluster in one of the reported zones, with a supplied GCP project (not the core OSD shared project, ie: CCS, or "Customer Cloud Subscription").

      Example:

      $ ocm create cluster --provider=gcp --multi-az --ccs --secure-boot-for-shielded-vms --region asia-east2 --service-account-file ~/.config/gcloud/chcollin1-dev-acct.json --channel-group candidate --version openshift-v4.17.0-rc.3-candidate chcollin-4170rc3-gcp
      

      Requesting a GCP install via an install-config with controlPlane.platform.gcp.zones out of order seems to reliably reproduce.

      Actual results

      Install will fail in OSD, but a cluster will be created with multiple extra control-plane nodes, and the API server will respond on the master-1 node.

      Expected results

      A standard 3 control-plane-node cluster is created.

      Additional info

      We're unsure what it is about the two reported Zones or the difference between the primary OSD GCP project and customer-supplied Projects that has an effect.

      The only thing we've noticed is the install-config has the order backwards for compute nodes, but not for control plane nodes:

      {
        "controlPlane": [
          "us-west2-a",
          "us-west2-b",
          "us-west2-c"
        ],
        "compute": [
          "us-west2-c",     <--- inverted order.  Shouldn't matter when building control-plane Machines, but maybe cross-contaminated somehow?
          "us-west2-b",
          "us-west2-a"
        ],
        "platform": {
          "defaultMachinePlatform": {  <--- nothing about zones in here, although again, the controlPlane block should override any zones configured here
            "osDisk": {
              "DiskSizeGB": 0,
              "diskType": ""
            },
            "secureBoot": "Enabled",
            "type": ""
          },
          "projectID": "anishpatel",
          "region": "us-west2"
        }
      }
      

      Since we see the divergence at the asset/manifest level, we should be able to reproduce with just an openshift-install create manifests, followed by grep -r zones: or something, without having to wait for an actuall install attempt to come up and fail.

              rdossant Rafael Fonseca dos Santos
              chcollin Chris Collins
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: