-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.22
This is a clone of issue OCPBUGS-77498. The following is the description of the original issue:
—
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:
verify operator conditions machine-config
Extreme regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 40.00%.
Sample (being evaluated) Release: 4.22
Start Time: 2026-02-21T00:00:00Z
End Time: 2026-02-28T04:00:00Z
Success Rate: 40.00%
Successes: 2
Failures: 3
Flakes: 0
Base (historical) Release: 4.21
Start Time: 2026-01-04T00:00:00Z
End Time: 2026-02-03T23:59:59Z
Success Rate: 100.00%
Successes: 35
Failures: 0
Flakes: 0
View the test details report for additional context.
Filed by: jialiu@redhat.com
=============================================
ai-helper analysis:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-vsphere-ovn-multi-vcenter/2026401746883973120
install-config.yaml:
platform:
vsphere:
vcenters:
- server: vcenter-1.ci.ibmc.devcluster.openshift.com
user: ci-user-1@ci.ibmc.devcluster.openshift.com
datacenters:
- cidatacenter-2
- server: vcenter.ci.ibmc.devcluster.openshift.com
user: ci-user-0@ci.ibmc.devcluster.openshift.com
datacenters:
- cidatacenter
- cidatacenter-1
failureDomains:
- server: vcenter-1.ci.ibmc.devcluster.openshift.com
name: funny_solomon
zone: us-central-1a
region: us-central
topology:
resourcePool: /cidatacenter-2/host/cicluster-3/Resources/ipi-ci-clusters
computeCluster: /cidatacenter-2/host/cicluster-3
datacenter: cidatacenter-2
datastore: /cidatacenter-2/datastore/vsanDatastore
networks:
- ci-vlan-1108-2
- server: vcenter.ci.ibmc.devcluster.openshift.com
name: pensive_roentgen
zone: us-east-1a
region: us-east
topology:
resourcePool: /cidatacenter/host/cicluster/Resources/ipi-ci-clusters
computeCluster: /cidatacenter/host/cicluster
datacenter: cidatacenter
datastore: /cidatacenter/datastore/vsanDatastore
networks:
- ci-vlan-1108-2
- server: vcenter.ci.ibmc.devcluster.openshift.com
name: nervous_matsumoto
zone: us-west-1a
region: us-west
topology:
resourcePool: /cidatacenter-1/host/cicluster-1/Resources/ipi-ci-clusters
computeCluster: /cidatacenter-1/host/cicluster-1
datacenter: cidatacenter-1
datastore: /cidatacenter-1/datastore/vsanDatastore-1
networks:
- ci-vlan-1108-2
NAMESPACE NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE openshift-machine-api ci-op-n5bhwdhc-25cd7-ptffx-master-0 Running us-central us-central-1a 59m ci-op-n5bhwdhc-25cd7-ptffx-master-0 vsphere://42237f0f-63bc-5582-0c7f-267be74acfe2 poweredOn openshift-machine-api ci-op-n5bhwdhc-25cd7-ptffx-master-1 Running us-east us-east-1a 59m ci-op-n5bhwdhc-25cd7-ptffx-master-1 vsphere://42108e0c-4bb7-123d-5226-da11f3cf7aef poweredOn openshift-machine-api ci-op-n5bhwdhc-25cd7-ptffx-master-2 Running us-west us-west-1a 59m ci-op-n5bhwdhc-25cd7-ptffx-master-2 vsphere://4210e49e-9555-05fb-0a29-9b59a04baa09 poweredOn openshift-machine-api ci-op-n5bhwdhc-25cd7-ptffx-worker-0-h4x66 Running us-central us-central-1a 56m ci-op-n5bhwdhc-25cd7-ptffx-worker-0-h4x66 vsphere://4223edd9-a577-c316-647d-8e4a691f1c7f poweredOn openshift-machine-api ci-op-n5bhwdhc-25cd7-ptffx-worker-1-6dk78 Running us-east us-east-1a 56m ci-op-n5bhwdhc-25cd7-ptffx-worker-1-6dk78 vsphere://42108f5b-0ae8-74e0-3c07-355afcf353fa poweredOn openshift-machine-api ci-op-n5bhwdhc-25cd7-ptffx-worker-2-849rf Running us-west us-west-1a 56m ci-op-n5bhwdhc-25cd7-ptffx-worker-2-849rf vsphere://4210589c-0c8c-8a3c-5ec6-2a82c1927152 poweredOn
Summary of the Datacenter Mismatch Bug
What's Actually Configured in MachineSets:
- worker-0: cidatacenter-2 (vcenter-1, template: funny_solomon)
- worker-1: cidatacenter (vcenter, template: pensive_roentgen)
- worker-2: cidatacenter-1 (vcenter, template: nervous_matsumoto) ✓ Exactly as you expected!
What machine-config-controller is Looking For:
- worker-0: cidatacenter ❌ (WRONG - this is worker-1's datacenter!)
- worker-1: cidatacenter-2 ❌ (WRONG - this is worker-0's datacenter!)
- worker-2: cidatacenter-2 ❌ (WRONG - should be cidatacenter-1!)
Key Finding:
This is a critical bug in the machine-config-controller (specifically in ms_helpers.go:85 and related vSphere helper code). The controller is:
1. Not reading the datacenter directly from the MachineSet's providerSpec.workspace.datacenter field
2. Mis-mapping datacenters - it appears worker-0 and worker-1's datacenters are swapped
3. Completely wrong for worker-2 - it's using cidatacenter-2 (funny_solomon) instead of cidatacenter-1 (nervous_matsumoto)
Root Cause Hypothesis:
The machine-config-controller may be:
- Reading failure domain information from the Infrastructure object
- Incorrectly mapping failure domains to MachineSets (perhaps by index or name sorting)
- Using cached/stale datacenter information
- Has a bug in multi-vCenter failure domain resolution logic
This explains why the installation failed - the controller was trying to verify datacenters in vSphere that don't match what's actually configured in the MachineSets, causing all three worker MachineSets to be marked as degraded.
● YES - EXACT SAME ROOT CAUSE!
Here's the proof:
Failure Domain Configuration:
- funny_solomon: cidatacenter-2
- nervous_matsumoto: cidatacenter-1
- pensive_roentgen: cidatacenter
Actual MachineSet Configurations (CORRECT):
- worker-0: cidatacenter-2 + template funny_solomon ✓
- worker-1: cidatacenter-1 + template nervous_matsumoto ✓
- worker-2: cidatacenter + template pensive_roentgen ✓
What machine-config-controller Looked For (WRONG):
- worker-0: cidatacenter-1 ❌ (should be cidatacenter-2)
- worker-1: cidatacenter-2 ❌ (should be cidatacenter-1)
- worker-2: cidatacenter-2 ❌ (should be cidatacenter)
—
Root Cause Summary
Both jobs failed due to the SAME BUG in /root/machine-config-operator/pkg/controller/bootimage/vsphere_helpers.go lines 460-464:
The Buggy Code:
if providerSpec.Workspace.Datastore != failureDomain.Topology.Datastore &&
vcenter.Server != failureDomain.Server &&
providerSpec.Workspace.VMGroup != vmGroup &&
path.Clean(providerSpec.Workspace.ResourcePool) == path.Clean(failureDomain.Topology.ResourcePool)
Critical Issues:
1. Inverted logic: Uses != instead of ==
2. Missing datacenter check: Never checks if datacenters match!
3. Wrong behavior: Skips matching failure domains instead of using them
Introduced in:
- Commit: edc881df57
- Author: RishabhSaini
- Date: July 2, 2025
- Message: "platform_helpers: Add support for vSphere"
This bug causes the machine-config-controller to incorrectly map MachineSets to failure domains, resulting in datacenter lookup failures and installation timeouts. Both Prow jobs are victims of this same bug.
================================================================================
THE BUG
================================================================================
File: pkg/controller/bootimage/vsphere_helpers.go
Lines: 460-464
Component: machine-config-operator boot image controller
BUGGY CODE:
-----------
if providerSpec.Workspace.Datastore != failureDomain.Topology.Datastore &&
vcenter.Server != failureDomain.Server &&
providerSpec.Workspace.VMGroup != vmGroup &&
path.Clean(providerSpec.Workspace.ResourcePool) == path.Clean(failureDomain.Topology.ResourcePool)
CORRECT CODE SHOULD BE:
-----------------------
if providerSpec.Workspace.Datacenter == failureDomain.Topology.Datacenter &&
providerSpec.Workspace.Datastore == failureDomain.Topology.Datastore &&
vcenter.Server == failureDomain.Server &&
providerSpec.Workspace.VMGroup == vmGroup &&
path.Clean(providerSpec.Workspace.ResourcePool) == path.Clean(failureDomain.Topology.ResourcePool) {
// Found match - proceed with this failure domain
THREE CRITICAL ERRORS:
1. Missing datacenter check (MOST CRITICAL)
2. Inverted operators (!= instead of ==)
3. Wrong control flow (continue instead of executing on match)
- clones
-
OCPBUGS-77498 Component Readiness: [Machine Config Operator] [operator-conditions] MCO incorrectly map MachineSets to failure domains, resulting in datacenter lookup failures and installation timeouts
-
- ON_QA
-
- is blocked by
-
OCPBUGS-77498 Component Readiness: [Machine Config Operator] [operator-conditions] MCO incorrectly map MachineSets to failure domains, resulting in datacenter lookup failures and installation timeouts
-
- ON_QA
-
- links to