-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.17.z
Description of problem
During install of multi-AZ OSD GCP clusters into customer-provided GCP projects, extra control plane nodes are created by the installer. This may be limited to a few regions, and has show up in our testing in us-west2 and asia-east2.
When the cluster is installed, the installer provisions three control plane nodes via the cluster-api:
- master-0 in AZ *a
- master-1 in AZ *b
- master-2 in AZ *c
However, the Machine manifest for master-0 and master-2 are written with the wrong AZs (master-0 in AZ *c and master-2 in AZ *a).
When the Machine controller in-cluster starts up and parses the manifests, it cannot find a VM for master-0 in AZ *c, or master-2 in *a, so it proceeds to try to create new VMs for those cases. master-1 is identified correctly, and unaffected.
This results in the cluster coming up with three control plane nodes, with master-0 and master-2 having no backing Machines, three control plane Machines, with only master-1 having a Node link, and the other two listed in Provisioned state, but with no Nodes, and 5 GCP VMs for these control plane nodes:
- master-0 in AZ *a
- master-0 in AZ *c
- master-1 in AZ *b
- master-2 in AZ *a
- master-2 in AZ *c
This happens consistently, across multiple GCP projects, so far in us-west2 and asia-east2 ONLY.
4.16.z clusters work as expected, as do clusters upgraded from 4.16.z to 4.17.z.
Version-Release number of selected component
4.17.0-rc3 - 4.17.0-rc6 have all been identified as having this issue.
How reproducible
100%
Steps to Reproduce
I'm unsure how to replicate this in vanilla cluster install, but via OSD:
- Create a multi-az cluster in one of the reported zones, with a supplied GCP project (not the core OSD shared project, ie: CCS, or "Customer Cloud Subscription").
Example:
$ ocm create cluster --provider=gcp --multi-az --ccs --secure-boot-for-shielded-vms --region asia-east2 --service-account-file ~/.config/gcloud/chcollin1-dev-acct.json --channel-group candidate --version openshift-v4.17.0-rc.3-candidate chcollin-4170rc3-gcp
Requesting a GCP install via an install-config with controlPlane.platform.gcp.zones out of order seems to reliably reproduce.
Actual results
Install will fail in OSD, but a cluster will be created with multiple extra control-plane nodes, and the API server will respond on the master-1 node.
Expected results
A standard 3 control-plane-node cluster is created.
Additional info
We're unsure what it is about the two reported Zones or the difference between the primary OSD GCP project and customer-supplied Projects that has an effect.
The only thing we've noticed is the install-config has the order backwards for compute nodes, but not for control plane nodes:
{ "controlPlane": [ "us-west2-a", "us-west2-b", "us-west2-c" ], "compute": [ "us-west2-c", <--- inverted order. Shouldn't matter when building control-plane Machines, but maybe cross-contaminated somehow? "us-west2-b", "us-west2-a" ], "platform": { "defaultMachinePlatform": { <--- nothing about zones in here, although again, the controlPlane block should override any zones configured here "osDisk": { "DiskSizeGB": 0, "diskType": "" }, "secureBoot": "Enabled", "type": "" }, "projectID": "anishpatel", "region": "us-west2" } }
Since we see the divergence at the asset/manifest level, we should be able to reproduce with just an openshift-install create manifests, followed by grep -r zones: or something, without having to wait for an actuall install attempt to come up and fail.
- blocks
-
OCPBUGS-42699 Extra control plane VMs created during GCP install in 4.17+
- Closed
- is cloned by
-
OCPBUGS-42699 Extra control plane VMs created during GCP install in 4.17+
- Closed
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update
- mentioned on