-
Story
-
Resolution: Done
-
Major
-
Global Hub 1.6.0
-
Product / Portfolio Work
-
2
-
False
-
-
False
-
-
-
-
GH Train-32
-
Important
-
None
Value Statement
As a cluster admin, I want the migration workflow to recover gracefully from manager or agent restarts so that active migrations can continue without getting stuck or causing inconsistent states.
related document: https://docs.google.com/document/d/1ORgdwDZ_LDqoJTm8Gt0QnW20UiwN-QUuy-rlXE5kQPk/edit?tab=t.0#heading=h.mup3andmb0xk
Description:
The migration system currently maintains in-memory state (MigrationStatus) to track hub-phase progress and cluster mappings. If a manager or agent restarts during migration, this state is lost, potentially leading to:
- In-memory State Loss: Migration phases appear stuck due to missing HubState or Clusters context.
- Event Confirmation Missing: Initialization, deployment, and registration phases may not receive agent confirmations, causing duplicate or conflicting operations.
- Agent Restart Impact: Restarted agents lose migration context, resulting in incomplete resource cleanup.
- Coordination Issues: Manager and agents may become out of sync, leading to failed rollbacks.
Definition of Done for Engineering Story Owner (Checklist)
**
- Migration progress is restored from a persistent state store after any manager or agent restart.
- The system replays or reconciles pending events to ensure no duplicate or missing actions.
- Rollback operations function correctly even if restarts occur mid-migration.
- Implement idempotent operations for all migration actions to handle event replays safely.
- Agent restarts do not leave orphaned resources or partially migrated clusters.
- Logs clearly indicate when recovery and resynchronization have occurred.
Development Complete
- The code is complete.
- Functionality is working.
- Any required downstream Docker file changes are made.
Tests Automated
- [x] Unit/function tests have been automated and incorporated into the
build. - [x] 100% automated unit/function test coverage for new or changed APIs.
Secure Design
[ ] Security has been assessed and incorporated into your threat model.
Multidisciplinary Teams Readiness
[ ] Create an informative documentation issue using the Customer
Portal Doc template that you can access from [The Playbook](
and ensure doc acceptance criteria is met.
Call out this sentence as it's own action:
[ ] Link the development issue to the doc issue.
Support Readiness
[ ] The must-gather script has been updated.
- clones
-
ACM-21769 As a cluster admin, I can see that the migration can be still in progress if the manager or agent is restarted
-
- Closed
-