Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63411

Race condition prevents retries for CAPI Infra Machine creation when authoritativeAPI is ClusterAPI

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • CLOUD Sprint 279
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      A race condition exists in the `cluster-capi-operator`'s machine synchronization logic when a MAPI Machine is manually created with `.spec.authoritativeAPI: ClusterAPI`.

      If the operator successfully creates the initial {}CAPI Machine (core){} but encounters an error or is interrupted before the corresponding {}CAPI Infra Machine{} is fully created, the controller fails to re-enter the reconciliation path required to create the Infra Machine on subsequent runs.

      This occurs because the prerequisite checks leading to `reconcileCAPIMachinetoMAPIMachine` currently evaluate that a CAPI Machine exists (due to the presence of the core CAPI Machine), causing the reconciliation flow to exit prematurely and never attempt the retry for the missing Infra Machine. This leaves the MAPI Machine and the Cluster in an inconsistent state.

      {}Affected Area:{} `pkg/controllers/machinesync/machine_sync_controller.go`

      Version-Release number of selected component (if applicable):

          

      How reproducible:

       

      Steps to Reproduce:

      1. User creates a MAPI Machine resource. 
      2. Sets `.spec.authoritativeAPI: ClusterAPI`. 
      3. CAPI Machine (core) is created successfully. 
      4. Infra Machine creation fails or is interrupted due to a temporary error. 
      5. Operator stops trying to create the Infra Machine.

      Actual results:

          

      Expected results:

          

      Additional info:

      Proposed solution

      The condition check for capiMachineNotFound should be broadened to ensure that the reconciliation path is triggered whenever any required CAPI component (core machine or infra machine) is missing, but maybe only when we know the CAPI Machine was created from the MAPI machine.

      Suggested change to the prerequisite logic (or similar): capiMachineNotFound := capiCoreMachineNotFound || capiInfraMachineNotFound

      This will ensure that the reconcileCAPIMachinetoMAPIMachine function runs reliably when the CAPI Machine structure is incomplete, allowing the Infra Machine creation to be retried.

              rh-ee-cschlott Christian Schlotter
              rh-ee-cschlott Christian Schlotter
              None
              None
              Zhaohua Sun Zhaohua Sun
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: