Uploaded image for project: 'OpenShift Installer'
  1. OpenShift Installer
  2. CORS-4339

vSphere worker MachineSets generated with swapped RHCOS template names accross failure domains

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • openshift-4.22
    • Installer Core
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None

      On a multi-datacenter vSphere IPI installation with two failure domains, the installer generates worker MachineSet objects with RHCOS template names cross-wired across datacenters. Every worker machine creation fails immediately with template not found, specify valid value. The same install-config.yaml produces a working cluster on 4.21.0-0.nightly-2026-02-18-135253 but fails on 4.22.0-0.ci-2026-02-17-214607.


      Regression

      Image Result
      registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2026-02-18-135253 Cluster installs successfully, workers provision correctly
      registry.ci.openshift.org/ocp/release:4.22.0-0.ci-2026-02-17-214607 Workers fail — MachineSet templates are swapped across failure domains

      Environment

      Field Value
      Failing build 4.22.0-0.ci-2026-02-17-214607
      Working build 4.21.0-0.nightly-2026-02-18-135253
      Platform vSphere IPI
      vCenter VMware vCenter Server 9.1.0
      Failure domain count 2 (separate datacenters, separate clusters)
      Cluster name rbednarnested

      install-config.yaml (same file used for both builds)

      baseDomain: devqe.ibmc.devcluster.openshift.com
      compute:
      - name: worker
        replicas: 3
        platform:
          vsphere:
            zones:
            - us-east-1
            - us-west-1
      controlPlane:
        name: master
        replicas: 3
        platform:
          vsphere:
            zones:
            - us-east-1
            - us-west-1
      metadata:
        name: rbednarnested
      platform:
        vsphere:
          apiVIPs:
          - 10.184.15.134
          ingressVIPs:
          - 10.184.15.135
          failureDomains:
          - name: us-east-1
            region: us-east-1
            zone: us-east-1a
            server: 232-15-184-10.in-addr.arpa
            topology:
              computeCluster: "/nested-devqedatacenter-1/host/nested-devqecluster-1"
              datacenter: nested-devqedatacenter-1
              datastore: "/nested-devqedatacenter-1/datastore/dsnested"
              networks:
              - VM Network
          - name: us-west-1
            region: us-west-1
            zone: us-west-1a
            server: 232-15-184-10.in-addr.arpa
            topology:
              computeCluster: "/nested-devqedatacenter-2/host/nested-devqecluster-2"
              datacenter: nested-devqedatacenter-2
              datastore: "/nested-devqedatacenter-2/datastore/dsnested"
              networks:
              - VM Network
          vcenters:
          - server: 232-15-184-10.in-addr.arpa
            datacenters:
            - nested-devqedatacenter-1
            - nested-devqedatacenter-2
      

      What the Installer Did Correctly

      The installer uploaded the RHCOS OVA to each datacenter under the correct name:

      Importing OVA rbednarnested-6jz6s-rhcos-us-east-1 into failure domain us-east-1
        → uploaded to: /nested-devqedatacenter-1/vm/rbednarnested-6jz6s/rbednarnested-6jz6s-rhcos-us-east-1
      
      Importing OVA rbednarnested-6jz6s-rhcos-us-west-1 into failure domain us-west-1
        → uploaded to: /nested-devqedatacenter-2/vm/rbednarnested-6jz6s/rbednarnested-6jz6s-rhcos-us-west-1
      

      Both templates exist in vSphere and are properly marked as templates (config.template: true). vSphere zone and region tags are correctly applied:

      Object Tag
      nested-devqedatacenter-1 us-east-1 (region)
      nested-devqedatacenter-2 us-west-1 (region)
      nested-devqecluster-1 us-east-1a (zone)
      nested-devqecluster-2 us-west-1a (zone)

      What the Installer Got Wrong — Worker MachineSets

      The generated worker MachineSet objects have the RHCOS template names swapped relative to their workspace datacenter:

      $ oc get machineset rbednarnested-6jz6s-worker-0 -n openshift-machine-api -o jsonpath='Template: {.spec.template.spec.providerSpec.value.template}{"\n"}Datacenter: {.spec.template.spec.providerSpec.value.workspace.datacenter}{"\n"}'
      
      Template:   rbednarnested-6jz6s-rhcos-us-west-1   ← WRONG (us-west-1 template in us-east-1 datacenter)
      Datacenter: nested-devqedatacenter-1
      
      $ oc get machineset rbednarnested-6jz6s-worker-1 -n openshift-machine-api -o jsonpath='Template: {.spec.template.spec.providerSpec.value.template}{"\n"}Datacenter: {.spec.template.spec.providerSpec.value.workspace.datacenter}{"\n"}'
      
      Template:   rbednarnested-6jz6s-rhcos-us-east-1   ← WRONG (us-east-1 template in us-west-1 datacenter)
      Datacenter: nested-devqedatacenter-2
      

      Correct assignment should be:

      MachineSet Datacenter Expected template Actual template
      worker-0 nested-devqedatacenter-1 (us-east-1) rhcos-us-east-1 rhcos-us-west-1
      worker-1 nested-devqedatacenter-2 (us-west-1) rhcos-us-west-1 rhcos-us-east-1

      Observed Failure

      Because the machine-api searches for the template by name scoped to the workspace datacenter, every worker machine creation fails immediately:

      $ oc describe machine rbednarnested-6jz6s-worker-0-9h9pt -n openshift-machine-api
      
        Error Message:  template not found, specify valid value
        Error Reason:   InvalidConfiguration
        Phase:          Failed
      

      All 3 worker Machine objects reach Failed phase. No worker nodes join the cluster. Operators depending on workers (ingress, authentication, monitoring, console) remain degraded.


      Confirmed Working in 4.21

      The same install-config.yaml against 4.21.0-0.nightly-2026-02-18-135253 produces MachineSets with correctly matched template names and workers provision successfully.

              Unassigned Unassigned
              rbednar@redhat.com Roman Bednar
              None
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: