Description of problem
GCP has added two AI zones, so all GCP installs to us-south1 and us-central1 regions may fail unless you explicitly select zones in your install-config. All existing OCP installers are exposed, with from 4.12 through 4.22. The only known mitigations are explicitly setting zones in the install-config, or using a region that does not include AI zones.
Version-Release number of selected component
All installer versions, at least as far back as 4.12, and probably all v4 GCP installers ever.
How reproducible
Every time? At least very common.
Steps to Reproduce
1. Create a GCP clustr in either us-south1 or us-central1, and do not specify zones in the install-config.
Actual results
Watch the install fail with messages like minimum worker replica count ... not yet met and ...-ai... zone references in the installer logs.
Expected results
Successful installs.
Workaround
Specify the desired list of zones in which you want to install.
Additional information
Poking around in the OCP CI project, GCP seems to currently be wildly inconsistent in how it handles zone listing in relevant API calls. For example, regions/get does not commit to a zone ordering:
zones[] string
[Output Only] A list of zones available in this region, in the form of resource URLs.
orderBy string
Sorts list results by a certain order. By default, results are returned in alphanumerical order based on the resource name.
But testing with gcloud, they seem to be pretty consistently not sorting alphabetically by name, and also not even consistently including the AI zone:
$ for X in $(seq 100); do gcloud --format=json compute regions describe us-central1 | jq -c '.zones'; done | sort | uniq -c
31 ["https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-a","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-b","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-c","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-f"]
69 ["https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-a","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-b","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-c","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-f","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-ai1a"]
$ for X in $(seq 100); do gcloud --format=json compute zones list --filter name:us-central1 | jq -c '[.[].name]'; done | sort | uniq -c
84 ["us-central1-c","us-central1-a","us-central1-f","us-central1-b"]
16 ["us-central1-c","us-central1-a","us-central1-f","us-central1-b","us-central1-ai1a"]
And gcloud isn't secretly setting the orderBy parameter:
$ gcloud --verbosity=debug --format=json compute zones list --filter name:us-central1 2>&1 | grep 'GET\|sort\|order\|zones'
DEBUG: Running [gcloud.compute.zones.list] with arguments: [--filter: "name:us-central1", --format: "json", --verbosity: "debug"]
DEBUG: https://compute.googleapis.com:443 "GET /compute/v1/projects/openshift-gce-devel-ci-2/zones?alt=json&filter=name+eq+%22.%2A%5Cbus%5C-central1%5Cb.%2A%22&maxResults=500 HTTP/1.1" 200 None
INFO: cache collection=compute.zones api_version=v1 params=['project', 'zone']
"selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-c",
"selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-a",
"selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-f",
"selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-b",
"selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-ai1a",
So... pretty weird. Maybe they're having some trouble with their rollout, and they're currently split-brained about whether the region exists. And they also lost track of their nominal zones/list orderBy defaulting claims?
Google seems to be aware of the AI-zone-inclusion instability. Check https://console.cloud.google.com/servicehealth/incidents in your GCP project if you're seeing this, and look for an incident titled Google Compute Engine customers deploying VMs in us-central1 and us-south1 may experience Compute Engine selecting or displaying AI Zones . They mention 2026-01-23 as a possible date of initial impact, although the only Compute release notes in that space are Jan. 20 and Jan. 26, with no release notes for the 23rd, and neither of the two bracketing release notes sounding like they're talking about the AI zones.
- blocks
-
OCPBUGS-74672 GCP installs should succeed if 'zones' is not specified and the region has an AI zone
-
- New
-
- is cloned by
-
OCPBUGS-74672 GCP installs should succeed if 'zones' is not specified and the region has an AI zone
-
- New
-
- relates to
-
TRT-2529 GCP Install Failures: Instance has not been created, Instance not found on provider
-
- Closed
-
- links to
- mentioned on