-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.22
-
None
Summary
When switching a MachineSet's authoritativeAPI from MachineAPI to ClusterAPI, there is a race window where the MachineSetSyncController can overwrite changes made to the CAPI MachineSet (e.g., scaling replicas) because it uses status.authoritativeAPI to determine sync direction, which is not yet updated by the MachineSetMigrationController.
Impact
Any modifications to the CAPI MachineSet made during the migration window (between updating spec.authoritativeAPI and status.authoritativeAPI being updated) will be reverted by the MachineSetSyncController. This causes:
- Scale operations to be silently reverted
- Potential for other spec changes to be lost
- Confusing behavior for users/operators managing MachineSets during migration
Root Cause
The MachineSetSyncController determines sync direction based on mapiMachineSet.Status.AuthoritativeAPI (line 264 in machineset_sync_controller.go):
authoritativeAPI := mapiMachineSet.Status.AuthoritativeAPI switch { case authoritativeAPI == mapiv1beta1.MachineAuthorityMachineAPI: return r.reconcileMAPIMachineSetToCAPIMachineSet(ctx, mapiMachineSet, capiMachineSet) // MAPI → CAPI case authoritativeAPI == mapiv1beta1.MachineAuthorityClusterAPI && capiMachineSet != nil: return r.reconcileCAPIMachineSetToMAPIMachineSet(ctx, capiMachineSet, mapiMachineSet) // CAPI → MAPI
However, status.authoritativeAPI is only updated by the MachineSetMigrationController after it:
- Waits for the old authoritative resource to be paused
- Unpauses the new authoritative resource
- Updates status.authoritativeAPI
During this window, the MachineSetSyncController continues syncing MAPI→CAPI, overwriting any CAPI changes with MAPI values.
Timeline from Failing Test
Test: [sig-cluster-lifecycle][OCPFeatureGate:MachineAPIMigration] MachineSet Migration CAPI Authoritative Tests Delete MachineSets when removing non-authoritative MAPI MachineSet shouldn't delete its authoritative CAPI MachineSet
| Timestamp | Event | Actor |
|---|---|---|
| 20:59:52.786 | Test updates MAPI MachineSet spec.authoritativeAPI: ClusterAPI | e2e test |
| 20:59:52.825 | Test scales CAPI MachineSet spec.replicas: 1 → 2 | e2e test |
| 20:59:52.858 | "Authoritative machine set and its copy are not synchronized yet, will retry later" | MachineSetMigrationController |
| 20:59:52.873 | "Changes detected for CAPI machine set. Updating it" with diff: .[spec].[replicas]: 2 != 1 | MachineSetSyncController |
| 20:59:52.901 | Failed to update CAPI MachineSet (conflict error) | MachineSetSyncController |
| 20:59:52.923 | Retry: "Changes detected for CAPI machine set. Updating it" with diff: .[spec].[replicas]: 2 != 1 | MachineSetSyncController |
| 20:59:53.022 | "Successfully updated CAPI machine set" — Overwrote replicas from 2 → 1 | MachineSetSyncController |
| 20:59:53.065 | "Detected migration request for machine set" | MachineSetMigrationController |
| 20:59:53.066 | Setting AuthoritativeAPI status to Migrating | MachineSetMigrationController |
| 20:59:53.191 | Setting AuthoritativeAPI status to ClusterAPI | MachineSetMigrationController |
| 20:59:53.214 | "Machine set authority switch has now been completed and the resource unpaused" | MachineSetMigrationController |
Key observation: The SyncController successfully overwrote CAPI's spec.replicas from 2 back to 1 at 20:59:53.022, which is ~200ms before the MigrationController completed the switch at 20:59:53.214.
Steps to Reproduce
- Create a MAPI MachineSet with spec.authoritativeAPI: MachineAPI and replicas: 1
- Wait for CAPI MachineSet mirror to be created
- Update MAPI MachineSet spec.authoritativeAPI: ClusterAPI
- Immediately scale CAPI MachineSet to replicas: 2
- Observe CAPI MachineSet spec.replicas being reverted to 1
Artifacts
Related Code
- pkg/controllers/machinesetsync/machineset_sync_controller.go:260-284 - Sync direction logic
- pkg/controllers/machinesetmigration/machineset_migration_controller.go:176-199 - Migration completion logic
- pkg/conversion/mapi2capi/machineset.go:50 - Replicas conversion
Acceptance Criteria
- [ ] CAPI MachineSet spec changes made during migration are not overwritten by MachineSetSyncController
- [ ] Add unit test covering the race condition scenario
- [ ] Existing e2e tests pass without requiring explicit waits for migration completion before scaling