[ACM-11961] [2.10] [RDR] [Hub recovery] Auto import of managed clusters remains stuck on switching hubs

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: ACM 2.10.4
Affects Version/s: ACM 2.10.0
Component/s: Business Continuity
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Git Commit:
https://github.com/stolostron/cluster-backup-operator/pull/550
Intelligence Requested:
Market:

Severity:
Critical

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

related to https://issues.redhat.com/browse/ACM-11926
Issue :
The auto import operation doesn't work ( is run too early ) when the cleanupBeforeRestore is set to None on the acm restore resource. The auto import may also not complete properly if the cleanupBeforeRestore is set to CleanupRestore but the clean up operation is completed before the managed clusters backup is fully restored.

Workaround:

create the acm restore acm-restore resource and wait for all velero restore resources to show as completed ( cleanupBeforeRestore can be set to None or CleanupRestore )
delete the acm-restore resource
create a new resource ( same name, or different ); since all resources are already restored, the post restore operation which runs the auto import of the managed clusters will be able to complete for all clusters.
The fix :
Acm restore checks if all velero restore resources are completed and only then tries to run the post restore operation which is: cleaning up delta resources followed by the auto import operation for the managed clusters.
The list of restore files should be refreshed when checking the overall status for this acm restore, otherwise it takes the first created velero restore ( which is the credentials restore ) and validates this status only. So the post restore operation starts as soon as the credentials backup is restored.
The issue is visible when the CleanupRestore option is set to None on the acm restore. In this case, the acm restore state is set to Finished as soon as the credentials restore is completed and since the post restore doesn't call the delta cleanup, the auto import operation ( which would be run after the resources cleanup ) executes before the managed clusters are restored so the auto import doesn't do anything.

How to reproduce the issue:

Have a backup hub with one managed cluster; enable MSA and run a schedule to create backups
On a new hub, create a restore all, with cleanup set to None, see below ( restore-acm )
The issue if reproduced : the status of the restore-acm doesn't show the messages info ( post restore was executed but no managed clusters were found and processed )
messages:

managed cluster amagrawa-c1-28my already available
Created auto-import-secret for (amagrawa-c2-my28)
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
name: restore-acm
namespace: open-cluster-management-backup
spec:
cleanupBeforeRestore: None
veleroManagedClustersBackupName: latest
veleroCredentialsBackupName: latest
veleroResourcesBackupName: latest
A restore where this issue is reproduced ( see restore status missing the import section ):

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
creationTimestamp: "2024-05-30T15:47:04Z"
generation: 1
name: restore-acm
namespace: open-cluster-management-backup
resourceVersion: "5977827"
uid: 10f5a1e9-587e-481f-86d5-0c098f4b0950
spec:
cleanupBeforeRestore: None
veleroCredentialsBackupName: latest
veleroManagedClustersBackupName: latest
veleroResourcesBackupName: latest
status:
lastMessage: All Velero restores have run successfully
phase: Finished
veleroCredentialsRestoreName: restore-acm-acm-credentials-schedule-20240530153937-active
veleroGenericResourcesRestoreName: restore-acm-acm-resources-generic-schedule-20240530153937
veleroManagedClustersRestoreName: restore-acm-acm-managed-clusters-schedule-20240530153937
veleroResourcesRestoreName: restore-acm-acm-resources-schedule-20240530153937
The restore status should be

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
creationTimestamp: "2024-05-30T14:01:08Z"
generation: 1
name: restore-acm
namespace: open-cluster-management-backup
resourceVersion: "2822219"
uid: 31d13ca1-1af2-4e90-87c3-c95ad756683d
spec:
cleanupBeforeRestore: None
veleroCredentialsBackupName: latest
veleroManagedClustersBackupName: latest
veleroResourcesBackupName: latest
status:
lastMessage: All Velero restores have run successfully
messages:

managed cluster amagrawa-c1-28my already available
Created auto-import-secret for (amagrawa-c2-my28)
phase: Finished
veleroCredentialsRestoreName: restore-acm-acm-credentials-schedule-20240530120055-active
veleroGenericResourcesRestoreName: restore-acm-acm-resources-generic-schedule-20240530120055
veleroManagedClustersRestoreName: restore-acm-acm-managed-clusters-schedule-20240530120055
veleroResourcesRestoreName: restore-acm-acm-resources-schedule-20240530120055
Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

Have a backup hub with one managed cluster; enable MSA and run a schedule to create backups
On a new hub, create a restore all, with cleanup set to None, see below ( restore-acm )
The issue if reproduced : the status of the restore-acm doesn't show the messages info ( post restore was executed but no managed clusters were found and processed )

Actual results:

Expected results:

Additional info:

links to

RHSA-2024:132223 Red Hat Advanced Cluster Management 2.10.4 bug fixes and container updates

Errata Tool added a comment - 2024/07/10 7:55 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Low: Red Hat Advanced Cluster Management 2.10.4 security updates and bug fixes), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:4464

Errata Tool added a comment - 2024/07/10 7:55 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Low: Red Hat Advanced Cluster Management 2.10.4 security updates and bug fixes), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4464

Thuy Nguyen added a comment - 2024/06/24 11:24 PM

Verified on 2.10.4-DOWNSTREAM-2024-06-20-15-49-57 with OADP 1.4

Thuy Nguyen added a comment - 2024/06/24 11:24 PM Verified on 2.10.4-DOWNSTREAM-2024-06-20-15-49-57 with OADP 1.4

Valentina Birsan added a comment - 2024/05/31 7:12 PM

Found by the ODF test team while using the
cleanupBeforeRestore: None option

Valentina Birsan added a comment - 2024/05/31 7:12 PM Found by the ODF test team while using the cleanupBeforeRestore: None option

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2024/07/10 7:55 PM

Expand comment: Errata Tool added a comment - 2024/07/10 7:55 PM

Collapse comment: Thuy Nguyen added a comment - 2024/06/24 11:24 PM

Expand comment: Thuy Nguyen added a comment - 2024/06/24 11:24 PM

Collapse comment: Valentina Birsan added a comment - 2024/05/31 7:12 PM

Expand comment: Valentina Birsan added a comment - 2024/05/31 7:12 PM

People

Dates