Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74571

CCAPIO: MachineSetSyncController can overwrite CAPI MachineSet spec during authoritativeAPI migration due to race condition

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      When switching a MachineSet's authoritativeAPI from MachineAPI to ClusterAPI, there is a race window where the MachineSetSyncController can overwrite changes made to the CAPI MachineSet (e.g., scaling replicas) because it uses status.authoritativeAPI to determine sync direction, which is not yet updated by the MachineSetMigrationController.

      Impact

      Any modifications to the CAPI MachineSet made during the migration window (between updating spec.authoritativeAPI and status.authoritativeAPI being updated) will be reverted by the MachineSetSyncController. This causes:

      • Scale operations to be silently reverted
      • Potential for other spec changes to be lost
      • Confusing behavior for users/operators managing MachineSets during migration

      Root Cause

      The MachineSetSyncController determines sync direction based on mapiMachineSet.Status.AuthoritativeAPI (line 264 in machineset_sync_controller.go):
       
       

      authoritativeAPI := mapiMachineSet.Status.AuthoritativeAPI
       
      switch {
      case authoritativeAPI == mapiv1beta1.MachineAuthorityMachineAPI:
          return r.reconcileMAPIMachineSetToCAPIMachineSet(ctx, mapiMachineSet, capiMachineSet)  // MAPI → CAPI
      case authoritativeAPI == mapiv1beta1.MachineAuthorityClusterAPI && capiMachineSet != nil:
          return r.reconcileCAPIMachineSetToMAPIMachineSet(ctx, capiMachineSet, mapiMachineSet)  // CAPI → MAPI 

      However, status.authoritativeAPI is only updated by the MachineSetMigrationController after it:

      1. Waits for the old authoritative resource to be paused
      1. Unpauses the new authoritative resource
      1. Updates status.authoritativeAPI

      During this window, the MachineSetSyncController continues syncing MAPI→CAPI, overwriting any CAPI changes with MAPI values.

      Timeline from Failing Test

      Test: [sig-cluster-lifecycle][OCPFeatureGate:MachineAPIMigration] MachineSet Migration CAPI Authoritative Tests Delete MachineSets when removing non-authoritative MAPI MachineSet shouldn't delete its authoritative CAPI MachineSet

      Timestamp Event Actor
      20:59:52.786 Test updates MAPI MachineSet spec.authoritativeAPI: ClusterAPI e2e test
      20:59:52.825 Test scales CAPI MachineSet spec.replicas: 1 → 2 e2e test
      20:59:52.858 "Authoritative machine set and its copy are not synchronized yet, will retry later" MachineSetMigrationController
      20:59:52.873 "Changes detected for CAPI machine set. Updating it" with diff: .[spec].[replicas]: 2 != 1 MachineSetSyncController
      20:59:52.901 Failed to update CAPI MachineSet (conflict error) MachineSetSyncController
      20:59:52.923 Retry: "Changes detected for CAPI machine set. Updating it" with diff: .[spec].[replicas]: 2 != 1 MachineSetSyncController
      20:59:53.022 "Successfully updated CAPI machine set" — Overwrote replicas from 2 → 1 MachineSetSyncController
      20:59:53.065 "Detected migration request for machine set" MachineSetMigrationController
      20:59:53.066 Setting AuthoritativeAPI status to Migrating MachineSetMigrationController
      20:59:53.191 Setting AuthoritativeAPI status to ClusterAPI MachineSetMigrationController
      20:59:53.214 "Machine set authority switch has now been completed and the resource unpaused" MachineSetMigrationController

       
       
      Key observation: The SyncController successfully overwrote CAPI's spec.replicas from 2 back to 1 at 20:59:53.022, which is ~200ms before the MigrationController completed the switch at 20:59:53.214.


      Steps to Reproduce

      1. Create a MAPI MachineSet with spec.authoritativeAPI: MachineAPI and replicas: 1
      1. Wait for CAPI MachineSet mirror to be created
      1. Update MAPI MachineSet spec.authoritativeAPI: ClusterAPI
      1. Immediately scale CAPI MachineSet to replicas: 2
      1. Observe CAPI MachineSet spec.replicas being reverted to 1

      Artifacts

      Related Code

      • pkg/controllers/machinesetsync/machineset_sync_controller.go:260-284 - Sync direction logic
      • pkg/controllers/machinesetmigration/machineset_migration_controller.go:176-199 - Migration completion logic
      • pkg/conversion/mapi2capi/machineset.go:50 - Replicas conversion

      Acceptance Criteria

      • [ ] CAPI MachineSet spec changes made during migration are not overwritten by MachineSetSyncController
      • [ ] Add unit test covering the race condition scenario
      • [ ] Existing e2e tests pass without requiring explicit waits for migration completion before scaling

              rh-ee-nbrubake Nolan Brubaker
              ddonati@redhat.com Damiano Donati
              None
              None
              Zhaohua Sun Zhaohua Sun
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: