Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-23008

As a cluster admin, I want migration workflow recovery from manager restarts to prevent stuck

XMLWordPrintable

    • GH Train-32
    • Important
    • None

      Value Statement

      As a cluster admin, I want the migration workflow to recover gracefully from manager or agent restarts so that active migrations can continue without getting stuck or causing inconsistent states.

      related document: https://docs.google.com/document/d/1ORgdwDZ_LDqoJTm8Gt0QnW20UiwN-QUuy-rlXE5kQPk/edit?tab=t.0#heading=h.mup3andmb0xk

      Description:

      The migration system currently maintains in-memory state (MigrationStatus) to track hub-phase progress and cluster mappings. If a manager or agent restarts during migration, this state is lost, potentially leading to:

      • In-memory State Loss: Migration phases appear stuck due to missing HubState or Clusters context.
      • Event Confirmation Missing: Initialization, deployment, and registration phases may not receive agent confirmations, causing duplicate or conflicting operations.
      • Agent Restart Impact: Restarted agents lose migration context, resulting in incomplete resource cleanup.
      • Coordination Issues: Manager and agents may become out of sync, leading to failed rollbacks.

      Definition of Done for Engineering Story Owner (Checklist)

      **

      • Migration progress is restored from a persistent state store after any manager or agent restart.
      • The system replays or reconciles pending events to ensure no duplicate or missing actions.
      • Rollback operations function correctly even if restarts occur mid-migration.
      • Implement idempotent operations for all migration actions to handle event replays safely.
      • Agent restarts do not leave orphaned resources or partially migrated clusters.
      • Logs clearly indicate when recovery and resynchronization have occurred.

      Development Complete

      • The code is complete.
      • Functionality is working.
      • Any required downstream Docker file changes are made.

      Tests Automated

      • [x] Unit/function tests have been automated and incorporated into the
        build.
      • [x] 100% automated unit/function test coverage for new or changed APIs.

      Secure Design

      • [ ] Security has been assessed and incorporated into your threat model.

      Multidisciplinary Teams Readiness

      • [ ] Create an informative documentation issue using the Customer

      Portal Doc template that you can access from [The Playbook](

      https://docs.google.com/document/d/1YTqpZRH54Bnn4WJ2nZmjaCoiRtqmrc2w6DdQxe_yLZ8/edit#heading=h.9fvyr2rdriby),

      and ensure doc acceptance criteria is met.

      • Call out this sentence as it's own action:
      • [ ] Link the development issue to the doc issue.

      Support Readiness

      • [ ] The must-gather script has been updated.

              rh-ee-myan Meng Yan
              clyang82 Chunlin Yang
              Hui Chen Hui Chen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: