-
Task
-
Resolution: Done
-
Undefined
-
None
-
False
-
None
-
False
-
-
-
No
Create an informative issue (See each section, incomplete templates/issues won't be triaged)
Using the current documentation as a model, please complete the issue template.
Note: Doc team updates the current version and the two previous versions (n-2). For earlier versions, we will address only high-priority, customer-reported issues for releases in support.
Prerequisite: Start with what we have
Always look at the current documentation to describe the change that is needed. Use the source or portal link for Step 4:
- Use the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes
- Use the GitHub link to find the staged docs in the repository: https://github.com/stolostron/rhacm-docs
Describe the changes in the doc and link to your dev story
Provide info for the following steps:
1. - [x ] Mandatory Add the required version to the Fix version/s field.
2. - [ ] Mandatory Choose the type of documentation change.
- [x ] New topic in an existing section or new section
- [ ] Update to an existing topic
3. - [ ] Mandatory for GA content:
- [ ] Add steps and/or other important conceptual information here:
- [ ] Add Required access level for the user to complete the task here:
- [ ] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?)
- [ ] Add link to dev story here:
4. - [ ] Mandatory for bugs: What is the diff? Clearly define what the problem is, what the change is, and link to the current documentation:
We had a few questions on the hub backup slack that kept repeating
Instead of looking for the answer in the slack and pointing to that in the new threads, it would be easier if we have an FAQ section for the Business Continuity
What subjects I see so far for this section:
1. Can we test a hub disaster recovery failover without stopping the primary hub first ?
As explained in Preparing clusters before restoring activation data section, in a disaster scenario, the primary hub should be shut down before the managed cluster data is restored on the passive hub. If that cluster is running, the previous hub cluster tries to reimport the managed clusters when the managed cluster reconciliation finds that the managed clusters are no longer available.
If this is a just a test exercise and you do not want or can safely stop the primary cluster, you could create a network policy that would restrict the network instead of shutting down primary ACM. This network policy would restrict the primary hub cluster access to and from the managed clusters.
2. Can we rollback from the passive hub to the initial primary ?
It is indicated to always use a clean hub cluster for the passive hub. If the passive hub cluster has user data created on it prior to the restore operation, this hub could not be reliably used as a passive configuration: the data on this hub is not reflective of the data available with the backed up resources and after the restore is completed, there could be resources on this hub that were not part of the backup. During the hub restore operation, you could require for existing resources to be cleaned up if they are not part of the backup data being restored, but this only affects resource types known to the backup; any resources with a CRD not known to the hub backup resource will be ignored.
There could be one situation where is safe to use an existing hub cluster for a restore operation, and this is during a disaster recovery test. In this case you are just testing the hub backup scenario so you want to roll back to the primary hub after the backup data was restored on the passive hub. In this situation, the initial primary has not been used to create new resources, the backup data has just temporarily changed sides, from the primary hub to the passive.
3. Can ACM hub data be stored on multiple object storages for redundancy purposes ?
The OADP operator is used to backup and restore hub data; this operator supports writing to only one storage location during the backup and this storage location is defined with the DataProtectionApplication resource. Each customer has its own configuration for the data storage and based on that there are different tools to replicate this storage. This object storage data can be replicated using different tools, depending on the type of storage used by the user (https://velero.io/docs/v1.7/supported-providers/#s3-compatible-object-store-providers).
4. Do we have any concrete metrics for RPO/RTO regarding the failover between the two hub clusters? Do we have estimates about the storage space size this backup data needs?
The tests performed on large scaled environment show the following data for backup and restore
Using ACM 2.8 and OADP 1.1 and 3673 SNOs managed clusters ( >>for doc team : we will have different numbers for 2.9 as OADP 1.2 is used here ! I will post those numbers after Oct 16 - once OADP 1.2.2 is published ; so these numbers are for 2.7 and 2.8 <<)
Execution time for backups:
- credentials 2m5s (18272 resources, 55MiB backups size)
- managed clusters 3m22s (58655 resources, 38MiB backups size)
- resources 1m34s (1190 resources, 1.7MiB backups size)
- generic (user backup) 2m56s (0 resources, 16.5KiB backups size)
Total backup time 10m
Execution time for restore passive on the same hub:
- credentials 47m8s (18272 resources)
- resources 3m10s (1190 resources)
- generic (user backup) 0m (0 resources)
Total restore time 50m18s
The number of backup file are pruned using the veleroTtl option set when the BackupSchedule is created. Any backups with a creation time older than the specified TTL ( time to live ) are expired and will be automatically deleted from the storage location by Velero.
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: BackupSchedule
metadata:
name:schedule-acm
namespace:open-cluster-management-backup
spec:
veleroSchedule:0 */1 * * *
veleroTtl:120h