Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77577

Component Readiness: [Machine Config Operator] [operator-conditions] MCO incorrectly map MachineSets to failure domains, resulting in datacenter lookup failures and installation timeouts

    • None
    • False
    • Hide

      None

      Show
      None
    • 1
    • None
    • None
    • None
    • Rejected
    • None
    • In Progress
    • Bug Fix
    • Hide
      *Cause*: Faulty vCenter matching logic causes boot image update failures in multi center vsphere clusters.
      *Consequence*: Installation fails in 4.22(bootimage updates are on by default), upgrades fail in 4.21(if boot image updates are on)
      *Fix*: Matching vCenter logic was fixed.
      *Result*: 4.22 installs now succeed and bootimage updates work as expected.
      Show
      *Cause*: Faulty vCenter matching logic causes boot image update failures in multi center vsphere clusters. *Consequence*: Installation fails in 4.22(bootimage updates are on by default), upgrades fail in 4.21(if boot image updates are on) *Fix*: Matching vCenter logic was fixed. *Result*: 4.22 installs now succeed and bootimage updates work as expected.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-77498. The following is the description of the original issue:

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      verify operator conditions machine-config

      Extreme regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 100.00% to 40.00%.

      Sample (being evaluated) Release: 4.22
      Start Time: 2026-02-21T00:00:00Z
      End Time: 2026-02-28T04:00:00Z
      Success Rate: 40.00%
      Successes: 2
      Failures: 3
      Flakes: 0
      Base (historical) Release: 4.21
      Start Time: 2026-01-04T00:00:00Z
      End Time: 2026-02-03T23:59:59Z
      Success Rate: 100.00%
      Successes: 35
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      Filed by: jialiu@redhat.com

      =============================================

      ai-helper analysis:
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-vsphere-ovn-multi-vcenter/2026401746883973120

      install-config.yaml:
      platform:
        vsphere:
          vcenters:
            - server:       vcenter-1.ci.ibmc.devcluster.openshift.com
              user:         ci-user-1@ci.ibmc.devcluster.openshift.com
              datacenters:
                - cidatacenter-2
            - server:       vcenter.ci.ibmc.devcluster.openshift.com
              user:         ci-user-0@ci.ibmc.devcluster.openshift.com
              datacenters:
                - cidatacenter
                - cidatacenter-1
          failureDomains:
            - server:    vcenter-1.ci.ibmc.devcluster.openshift.com
              name:      funny_solomon
              zone:      us-central-1a
              region:    us-central
              topology:
                  resourcePool:    /cidatacenter-2/host/cicluster-3/Resources/ipi-ci-clusters
                  computeCluster:  /cidatacenter-2/host/cicluster-3
                  datacenter:      cidatacenter-2
                  datastore:       /cidatacenter-2/datastore/vsanDatastore
                  networks:
                    - ci-vlan-1108-2
            - server:    vcenter.ci.ibmc.devcluster.openshift.com
              name:      pensive_roentgen
              zone:      us-east-1a
              region:    us-east
              topology:
                  resourcePool:    /cidatacenter/host/cicluster/Resources/ipi-ci-clusters
                  computeCluster:  /cidatacenter/host/cicluster
                  datacenter:      cidatacenter
                  datastore:       /cidatacenter/datastore/vsanDatastore
                  networks:
                    - ci-vlan-1108-2
            - server:    vcenter.ci.ibmc.devcluster.openshift.com
              name:      nervous_matsumoto
              zone:      us-west-1a
              region:    us-west
              topology:
                  resourcePool:    /cidatacenter-1/host/cicluster-1/Resources/ipi-ci-clusters
                  computeCluster:  /cidatacenter-1/host/cicluster-1
                  datacenter:      cidatacenter-1
                  datastore:       /cidatacenter-1/datastore/vsanDatastore-1
                  networks:
                    - ci-vlan-1108-2

      NAMESPACE               NAME                                        PHASE     TYPE   REGION       ZONE            AGE   NODE                                        PROVIDERID                                       STATE
      openshift-machine-api   ci-op-n5bhwdhc-25cd7-ptffx-master-0         Running          us-central   us-central-1a   59m   ci-op-n5bhwdhc-25cd7-ptffx-master-0         vsphere://42237f0f-63bc-5582-0c7f-267be74acfe2   poweredOn
      openshift-machine-api   ci-op-n5bhwdhc-25cd7-ptffx-master-1         Running          us-east      us-east-1a      59m   ci-op-n5bhwdhc-25cd7-ptffx-master-1         vsphere://42108e0c-4bb7-123d-5226-da11f3cf7aef   poweredOn
      openshift-machine-api   ci-op-n5bhwdhc-25cd7-ptffx-master-2         Running          us-west      us-west-1a      59m   ci-op-n5bhwdhc-25cd7-ptffx-master-2         vsphere://4210e49e-9555-05fb-0a29-9b59a04baa09   poweredOn
      openshift-machine-api   ci-op-n5bhwdhc-25cd7-ptffx-worker-0-h4x66   Running          us-central   us-central-1a   56m   ci-op-n5bhwdhc-25cd7-ptffx-worker-0-h4x66   vsphere://4223edd9-a577-c316-647d-8e4a691f1c7f   poweredOn
      openshift-machine-api   ci-op-n5bhwdhc-25cd7-ptffx-worker-1-6dk78   Running          us-east      us-east-1a      56m   ci-op-n5bhwdhc-25cd7-ptffx-worker-1-6dk78   vsphere://42108f5b-0ae8-74e0-3c07-355afcf353fa   poweredOn
      openshift-machine-api   ci-op-n5bhwdhc-25cd7-ptffx-worker-2-849rf   Running          us-west      us-west-1a      56m   ci-op-n5bhwdhc-25cd7-ptffx-worker-2-849rf   vsphere://4210589c-0c8c-8a3c-5ec6-2a82c1927152   poweredOn 

      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-vsphere-ovn-multi-vcenter/2026401746883973120/artifacts/e2e-vsphere-ovn-multi-vcenter/gather-extra/artifacts/machinesets.json

       

        Summary of the Datacenter Mismatch Bug

        What's Actually Configured in MachineSets:
        - worker-0: cidatacenter-2 (vcenter-1, template: funny_solomon)
        - worker-1: cidatacenter (vcenter, template: pensive_roentgen)
        - worker-2: cidatacenter-1 (vcenter, template: nervous_matsumoto) ✓ Exactly as you expected!

        What machine-config-controller is Looking For:
        - worker-0: cidatacenter ❌ (WRONG - this is worker-1's datacenter!)
        - worker-1: cidatacenter-2 ❌ (WRONG - this is worker-0's datacenter!)
        - worker-2: cidatacenter-2 ❌ (WRONG - should be cidatacenter-1!)

        Key Finding:

        This is a critical bug in the machine-config-controller (specifically in ms_helpers.go:85 and related vSphere helper code). The controller is:

        1. Not reading the datacenter directly from the MachineSet's providerSpec.workspace.datacenter field
        2. Mis-mapping datacenters - it appears worker-0 and worker-1's datacenters are swapped
        3. Completely wrong for worker-2 - it's using cidatacenter-2 (funny_solomon) instead of cidatacenter-1 (nervous_matsumoto)

        Root Cause Hypothesis:

        The machine-config-controller may be:
        - Reading failure domain information from the Infrastructure object
        - Incorrectly mapping failure domains to MachineSets (perhaps by index or name sorting)
        - Using cached/stale datacenter information
        - Has a bug in multi-vCenter failure domain resolution logic

        This explains why the installation failed - the controller was trying to verify datacenters in vSphere that don't match what's actually configured in the MachineSets, causing all three  worker MachineSets to be marked as degraded.

       

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-vsphere-ovn-multi-vcenter/2027488878226575360

      ● YES - EXACT SAME ROOT CAUSE!

        Here's the proof:

        Failure Domain Configuration:
        - funny_solomon: cidatacenter-2
        - nervous_matsumoto: cidatacenter-1
        - pensive_roentgen: cidatacenter

        Actual MachineSet Configurations (CORRECT):
        - worker-0: cidatacenter-2 + template funny_solomon ✓
        - worker-1: cidatacenter-1 + template nervous_matsumoto ✓
        - worker-2: cidatacenter + template pensive_roentgen ✓

        What machine-config-controller Looked For (WRONG):
        - worker-0: cidatacenter-1 ❌ (should be cidatacenter-2)
        - worker-1: cidatacenter-2 ❌ (should be cidatacenter-1)
        - worker-2: cidatacenter-2 ❌ (should be cidatacenter)

        —
        Root Cause Summary

        Both jobs failed due to the SAME BUG in /root/machine-config-operator/pkg/controller/bootimage/vsphere_helpers.go lines 460-464:

        The Buggy Code:
        if providerSpec.Workspace.Datastore != failureDomain.Topology.Datastore &&
            vcenter.Server != failureDomain.Server &&
            providerSpec.Workspace.VMGroup != vmGroup &&
            path.Clean(providerSpec.Workspace.ResourcePool) == path.Clean(failureDomain.Topology.ResourcePool)

      {       continue   }

        Critical Issues:
        1. Inverted logic: Uses != instead of ==
        2. Missing datacenter check: Never checks if datacenters match!
        3. Wrong behavior: Skips matching failure domains instead of using them

        Introduced in:
        - Commit: edc881df57
        - Author: RishabhSaini
        - Date: July 2, 2025
        - Message: "platform_helpers: Add support for vSphere"

        This bug causes the machine-config-controller to incorrectly map MachineSets to failure domains, resulting in datacenter lookup failures and installation timeouts. Both Prow jobs are victims of this same bug.

       ================================================================================
       THE BUG
       ================================================================================

       File: pkg/controller/bootimage/vsphere_helpers.go
       Lines: 460-464
       Component: machine-config-operator boot image controller

       BUGGY CODE:
       -----------
       if providerSpec.Workspace.Datastore != failureDomain.Topology.Datastore &&
           vcenter.Server != failureDomain.Server &&
           providerSpec.Workspace.VMGroup != vmGroup &&
           path.Clean(providerSpec.Workspace.ResourcePool) == path.Clean(failureDomain.Topology.ResourcePool)

      {      continue  }

       CORRECT CODE SHOULD BE:
       -----------------------
       if providerSpec.Workspace.Datacenter == failureDomain.Topology.Datacenter &&
           providerSpec.Workspace.Datastore == failureDomain.Topology.Datastore &&
           vcenter.Server == failureDomain.Server &&
           providerSpec.Workspace.VMGroup == vmGroup &&
           path.Clean(providerSpec.Workspace.ResourcePool) == path.Clean(failureDomain.Topology.ResourcePool) {
           // Found match - proceed with this failure domain

       THREE CRITICAL ERRORS:
       1. Missing datacenter check (MOST CRITICAL)
       2. Inverted operators (!= instead of ==)
       3. Wrong control flow (continue instead of executing on match)

              djoshy David Joshy
              openshift-trt OpenShift Technical Release Team
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: