Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74625

GCP installs should succeed if 'zones' is not specified and the region has an AI zone

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.13, 4.12, 4.14, 4.15, 4.16, 4.17, 4.18, 4.19, 4.20, 4.21, 4.22
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • No
    • In Progress
    • Bug Fix
    • Hide
      *Cause*: Installing a cluster on GCP in us-south1 or us-central1 (the two that currently have AI zones) without specifying 'zones' in your install-config.
      *Consequence*: The installer will select the AI zone as one of the zones it defaults to. That zone is unlikely to have the machine types the installer wants to see for control plane and compute nodes, and installation will fail.
      *Fix*: AI zones are excluded from installer default-zone selection.
      *Result*: Bug doesn’t present anymore.
      Show
      *Cause*: Installing a cluster on GCP in us-south1 or us-central1 (the two that currently have AI zones) without specifying 'zones' in your install-config. *Consequence*: The installer will select the AI zone as one of the zones it defaults to. That zone is unlikely to have the machine types the installer wants to see for control plane and compute nodes, and installation will fail. *Fix*: AI zones are excluded from installer default-zone selection. *Result*: Bug doesn’t present anymore.
    • None
    • None
    • None
    • None

      Description of problem

      GCP has added two AI zones, so all GCP installs to us-south1 and us-central1 regions may fail unless you explicitly select zones in your install-config. All existing OCP installers are exposed, with from 4.12 through 4.22. The only known mitigations are explicitly setting zones in the install-config, or using a region that does not include AI zones.

      Version-Release number of selected component

      All installer versions, at least as far back as 4.12, and probably all v4 GCP installers ever.

      How reproducible

      Every time? At least very common.

      Steps to Reproduce

      1. Create a GCP clustr in either us-south1 or us-central1, and do not specify zones in the install-config.

      Actual results

      Watch the install fail with messages like minimum worker replica count ... not yet met and ...-ai... zone references in the installer logs.

      Expected results

      Successful installs.

      Workaround

      Specify the desired list of zones in which you want to install.

      Additional information

      Poking around in the OCP CI project, GCP seems to currently be wildly inconsistent in how it handles zone listing in relevant API calls. For example, regions/get does not commit to a zone ordering:

      zones[] string
      [Output Only] A list of zones available in this region, in the form of resource URLs.

      zones/list does:

      orderBy string
      Sorts list results by a certain order. By default, results are returned in alphanumerical order based on the resource name.

      But testing with gcloud, they seem to be pretty consistently not sorting alphabetically by name, and also not even consistently including the AI zone:

      $ for X in $(seq 100); do gcloud --format=json compute regions describe us-central1 | jq -c '.zones'; done | sort | uniq -c
           31 ["https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-a","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-b","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-c","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-f"]
           69 ["https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-a","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-b","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-c","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-f","https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-ai1a"]
      $ for X in $(seq 100); do gcloud --format=json compute zones list --filter name:us-central1 | jq -c '[.[].name]'; done | sort | uniq -c
           84 ["us-central1-c","us-central1-a","us-central1-f","us-central1-b"]
           16 ["us-central1-c","us-central1-a","us-central1-f","us-central1-b","us-central1-ai1a"]
      

      And gcloud isn't secretly setting the orderBy parameter:

      $ gcloud --verbosity=debug --format=json compute zones list --filter name:us-central1 2>&1 | grep 'GET\|sort\|order\|zones'
      DEBUG: Running [gcloud.compute.zones.list] with arguments: [--filter: "name:us-central1", --format: "json", --verbosity: "debug"]
      DEBUG: https://compute.googleapis.com:443 "GET /compute/v1/projects/openshift-gce-devel-ci-2/zones?alt=json&filter=name+eq+%22.%2A%5Cbus%5C-central1%5Cb.%2A%22&maxResults=500 HTTP/1.1" 200 None
      INFO: cache collection=compute.zones api_version=v1 params=['project', 'zone']
          "selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-c",
          "selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-a",
          "selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-f",
          "selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-b",
          "selfLink": "https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci-2/zones/us-central1-ai1a",
      

      So... pretty weird. Maybe they're having some trouble with their rollout, and they're currently split-brained about whether the region exists. And they also lost track of their nominal zones/list orderBy defaulting claims?

      Google seems to be aware of the AI-zone-inclusion instability.  Check https://console.cloud.google.com/servicehealth/incidents in your GCP project if you're seeing this, and look for an incident titled Google Compute Engine customers deploying VMs in us-central1 and us-south1 may experience Compute Engine selecting or displaying AI Zones .  They mention 2026-01-23 as a possible date of initial impact, although the only Compute release notes in that space are Jan. 20 and Jan. 26, with no release notes for the 23rd, and neither of the two bracketing release notes sounding like they're talking about the AI zones.

              padillon Patrick Dillon
              trking W. Trevor King
              None
              None
              Jianli Wei Jianli Wei
              None
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: