-
Task
-
Resolution: Unresolved
-
Normal
-
ACM 2.14.0, ACM 2.15.0
-
Quality / Stability / Reliability
-
7
-
False
-
-
False
-
-
-
-
None
1. - [ X] Mandatory: Add the required version to the Fix version/s field.
All currently supported ACM versions
2. - [ x] Mandatory: Choose the type of documentation change or review.
- [X] We need to update to an existing topic
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity
Note: As the feature and doc is understood, this recommendation may
change. If this is new documentation, link to the section where you think
it should be placed.
Customer Portal published version
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13
https://github.com/stolostron/rhacm-docs
4. - [ ] Mandatory for GA content:
Add a new section right before/after (I prefer before, and reorder the numbering)
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#backup-config-policy-virt
(1.1.14?). Configuring backup and restore for Observability
The ACM Observability service uses an S3-compatible object store to persist all time-series data collected from managed clusters. Because Observability is a stateful service, it is sensitive to active/passive failover patterns. Special care must be taken to ensure data continuity and integrity during hub cluster migration or failover.
1.1.14.1 Automatically Backed-Up and Restored Resources
The Observability service automatically adds the label `cluster.open-cluster-management.io/backup` to key resources so they are included in standard ACM backup and restore processes. For details, see the Backed-Up Resources documentation: https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#backed-up-resources
ConfigMaps:
- observability-metrics-custom-allowlist
- thanos-ruler-custom-rules
- alertmanager-config
- Any ConfigMap labeled with `grafana-custom-dashboard`
Secrets:
- thanos-object-storage
- observability-server-ca-certs
- observability-client-ca-certs
- observability-server-certs
- observability-grafana-certs
- alertmanager-byo-ca
- alertmanager-byo-cert
- proxy-byo-ca
- proxy-byo-cert
1.1.14.2 Resources That Must Be Backed Up Manually
Some Observability resources must be manually backed up and restored to ensure continuity across hub clusters:
1. Observatorium resource
Contains the tenant ID, which must be preserved during restore.
```
oc get observatorium -n open-cluster-management-observability -o yaml > observatorium-backup.yaml
```
2. MultiClusterObservability custom resource (MCO CR)
Defines the overall Observability deployment.
```
oc get mco observability -o yaml > mco-cr-backup.yaml
```
1.1.14.3 Backup and Restore Procedure
This procedure assumes that the same S3-compatible object store is used across both the primary and failover (backup) hub clusters to minimize disruption in metric data collection.
Step 1: Shut Down the Compactor on the Primary Hub
To prevent write conflicts and deduplication issues while working on the same objet storage, stop the Thanos compactor before starting restore on the backup hub:
```
oc scale statefulset observability-thanos-compact -n open-cluster-management-observability --replicas=0
```
Verify the compactor is stopped:
```
oc get pods observability-thanos-compact-0 -n open-cluster-management-observability
```
Step 2: Restore Backup Resources
Restore resources from backup using the process described in the Restore Backup section (
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#restore-backup) section to restore resources from backup and activate it. This restores the automatically backed-up ConfigMaps and Secrets listed in section 1.1.14.1.
Step 3: Restore the Observatorium Custom Resource
Restore the observatorium resource to the backup hub:
```
oc apply -f observatorium-backup.yaml
```
This preserves the tenant ID, which is critical for maintaining continuity in metrics ingestion and querying.
Step 4: Start Observability on the New (Restore) Hub
Apply the backed-up MultiClusterObservability CR:
```
oc apply -f mco-cr-backup.yaml
```
The operator will start the observability stack and detect the existing observatorium resource, reusing the preserved tenant ID instead of creating a new one.
Step 5: Migrate Managed Clusters to the New Hub
Detach managed clusters from the primary hub and re-attach them to the new (restored) hub. Once attached, they will resume sending metrics to the Observability service.
Step 6: Shut Down Observability on the Primary Hub
After all managed clusters have been migrated:
```
oc delete mco observability
```
This command gracefully shuts down the observability stack on the primary hub and flushes in-memory metrics to S3 before termination.
- [x] Add steps, the diff, known issue, and/or other important conceptual information in the following space:
1. Dual Write Period
During the migration process, both the primary and backup hub clusters may write to the S3 object store simultaneously (until all clusters are reattached). Because only one compactor is active, duplicate data will be eventually deduplicated by compactor on the new hub.
2. Cluster Identity and Grafana Visualization
Both primary and backup hubs use the default cluster name local-cluster. However, their internal cluster IDs differ, resulting in separate time series in Grafana.
To ensure that hub metrics are collected and visible in Grafana, hub self-management must be enabled in each ACM hub. This allows the observability stack to treat the hub cluster as a managed cluster and include its metrics for collection.
For ACM 2.14 and later: Use the local-cluster renaming feature to assign unique names to each hub. This helps disambiguate metrics for each hub cluster in Grafana.
3. Metric Gaps During Migration
No metrics are collected from a managed cluster between the time it is detached from the primary hub and re-attached to the backup hub. To minimize gaps, consider scripting the cluster migration for large fleets.
- [ ] Add Required access level *(example, *Cluster Administrator) for the user to complete the task:
- [x] Add verification at the end of the task, how does the userverify success (a command to run or a result to see?)
1. Access Grafana on the new active hub cluster.
2. Ensure that metrics appear for multiple clusters, including: - Managed clusters previously reporting to the primary hub
- The hub cluster itself (local-cluster or renamed variant)
Use grafana on the new active hub and verify you can query metrics from managed clusters including historic metrics.
Success Criteria: Queries return both recent and historical metrics for the expected set of clusters, with no gaps during the failover window (beyond expected detach/attach delay).
- [x] Add link to dev story here:
https://issues.redhat.com/browse/ACM-1001
The doc task specifically clarifies steps required for migrating Observability as part of back up restore procedure.
5. - [ ] Mandatory for bugs: What is the diff? Clearly define what the
problem is, what the change is, and link to the current documentation. Only
use this for a documentation bug.
- clones
-
ACM-20325 ACM Clarify Observability in backup restore scenario
-
- Closed
-