Loading...

Type: Spike
Resolution: Done
Priority: Major
Fix Version/s: ACM 2.15.0
Affects Version/s: ACM 2.14.0
Component/s: Documentation, Observability
Labels:

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Acceptance Criteria:
Hide

Provide the required acceptance criteria using this template.

...
Show
Provide the required acceptance criteria using this template. ...
Intelligence Requested:
Market:

Sprint:
MCO Core Sprint 46, MCO Core Sprint 47

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

This is original request

Can you confirm how this will work with an Active and Passive hub cluster please?

ANS: Addressed in the proposed documentation (see doc template)

I want data to be collected from ALL clusters, its no good to me having data from the passive hub cluster only available after failover. Is this possible?
If I connect Thanos in both clusters to the same S3 backing store will both sets of data be collected from the hub clusters and keyed separately or will it all be mixed up together under "local-cluster"?

ANS: Addressed in the proposed documentation (see doc template)
Both passive and active hub metrics will be collected as `local-cluster`, however, they appear as different time series in grafana as both these hub clusers have different cluster IDs, and can be differentiated using the clusterID attribute.

(2.14 only) - The best approach is to use the local cluster renaming feature on both active and passive clusters so that hub cluster metrics can be independently queried.

When we failover to the new cluster, what happens to the old cluster's data? i.e. if I open Observability in the new cluster, connected to the same S3 store, will it retrieve the data just for the current hub cluster or will it also retrieve the data from the old one and display it mixed with the new one?
Similarly, will it only start collecting data from the new hub cluster from the point of failover? i.e. historical data is missing / lost

ANS: The migration steps (see doc template) allow you to view new and historic data after the failover as long as the same S3 bucket is used as part of migration. Metrics data is not collected on a managed cluster during the time a managed cluster is actively detached from primary and until it is re-attached to the backup cluster. Customers can minimize this interval by scripting managed cluster migration to new hub, especially on a large fleet of managed clusters.

rhn-support-cstark please review and let me know if you these answers require further refinement.

1. - [ X] Mandatory: Add the required version to the Fix version/s field.

All supported ACM versions

2. - [ x] Mandatory: Choose the type of documentation change or review.

[X] We need to update to an existing topic
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity

Note: As the feature and doc is understood, this recommendation may
change. If this is new documentation, link to the section where you think
it should be placed.

Customer Portal published version

https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13

https://github.com/stolostron/rhacm-docs

4. - [ ] Mandatory for GA content:

Add a new section right before/after (I prefer before, and reorder the numbering)
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#backup-config-policy-virt

(1.1.14?). Configuring backup and restore for Observability

The ACM Observability service uses an S3-compatible object store to persist all time-series data collected from managed clusters. Because Observability is a stateful service, it is sensitive to active/passive failover patterns. Special care must be taken to ensure data continuity and integrity during hub cluster migration or failover.

1.1.14.1 Automatically Backed-Up and Restored Resources

The Observability service automatically adds the label `cluster.open-cluster-management.io/backup` to key resources so they are included in standard ACM backup and restore processes. For details, see the Backed-Up Resources documentation: https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#backed-up-resources

ConfigMaps:

observability-metrics-custom-allowlist
thanos-ruler-custom-rules
alertmanager-config
Any ConfigMap labeled with `grafana-custom-dashboard`

Secrets:

thanos-object-storage
observability-server-ca-certs
observability-client-ca-certs
observability-server-certs
observability-grafana-certs
alertmanager-byo-ca
alertmanager-byo-cert
proxy-byo-ca
proxy-byo-cert

1.1.14.2 Resources That Must Be Backed Up Manually

Some Observability resources must be manually backed up and restored to ensure continuity across hub clusters:

1. Observatorium resource
Contains the tenant ID, which must be preserved during restore.
```
oc get observatorium -n open-cluster-management-observability -o yaml > observatorium-backup.yaml
```

2. MultiClusterObservability custom resource (MCO CR)
Defines the overall Observability deployment.
```
oc get mco observability -o yaml > mco-cr-backup.yaml
```

1.1.14.3 Backup and Restore Procedure

This procedure assumes that the same S3-compatible object store is used across both the primary and failover (backup) hub clusters to minimize disruption in metric data collection.

Step 1: Shut Down the Compactor on the Primary Hub
To prevent write conflicts and deduplication issues while working on the same objet storage, stop the Thanos compactor before starting restore on the backup hub:
```
oc scale statefulset observability-thanos-compact -n open-cluster-management-observability --replicas=0
```
Verify the compactor is stopped:
```
oc get pods observability-thanos-compact-0 -n open-cluster-management-observability
```

Step 2: Restore Backup Resources
Restore resources from backup using the process described in the Restore Backup section (
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#restore-backup) section to restore resources from backup and activate it. This restores the automatically backed-up ConfigMaps and Secrets listed in section 1.1.14.1.

Step 3: Restore the Observatorium Custom Resource
Restore the observatorium resource to the backup hub:
```
oc apply -f observatorium-backup.yaml
```
This preserves the tenant ID, which is critical for maintaining continuity in metrics ingestion and querying.

Step 4: Start Observability on the New (Restore) Hub
Apply the backed-up MultiClusterObservability CR:
```
oc apply -f mco-cr-backup.yaml
```
The operator will start the observability stack and detect the existing observatorium resource, reusing the preserved tenant ID instead of creating a new one.

Step 5: Migrate Managed Clusters to the New Hub

Detach managed clusters from the primary hub and re-attach them to the new (restored) hub. Once attached, they will resume sending metrics to the Observability service.

Step 6: Shut Down Observability on the Primary Hub
After all managed clusters have been migrated:
```
oc delete mco observability
```
This command gracefully shuts down the observability stack on the primary hub and flushes in-memory metrics to S3 before termination.

[x] Add steps, the diff, known issue, and/or other important conceptual information in the following space:

1. Dual Write Period
During the migration process, both the primary and backup hub clusters may write to the S3 object store simultaneously (until all clusters are reattached). Because only one compactor is active, duplicate data will be eventually deduplicated by compactor on the new hub.

2. Cluster Identity and Grafana Visualization
Both primary and backup hubs use the default cluster name local-cluster. However, their internal cluster IDs differ, resulting in separate time series in Grafana.

To ensure that hub metrics are collected and visible in Grafana, hub self-management must be enabled in each ACM hub. This allows the observability stack to treat the hub cluster as a managed cluster and include its metrics for collection.

For ACM 2.14 and later: Use the local-cluster renaming feature to assign unique names to each hub. This helps disambiguate metrics for each hub cluster in Grafana.

3. Metric Gaps During Migration
No metrics are collected from a managed cluster between the time it is detached from the primary hub and re-attached to the backup hub. To minimize gaps, consider scripting the cluster migration for large fleets.

[ ] Add Required access level *(example, *Cluster Administrator) for the user to complete the task:

[x] Add verification at the end of the task, how does the userverify success (a command to run or a result to see?)
1. Access Grafana on the new active hub cluster.
2. Ensure that metrics appear for multiple clusters, including:
Managed clusters previously reporting to the primary hub
The hub cluster itself (local-cluster or renamed variant)
Use grafana on the new active hub and verify you can query metrics from managed clusters including historic metrics.
Success Criteria: Queries return both recent and historical metrics for the expected set of clusters, with no gaps during the failover window (beyond expected detach/attach delay).

[x] Add link to dev story here:
https://issues.redhat.com/browse/ACM-1001

The doc task specifically clarifies steps required for migrating Observability as part of back up restore procedure.

5. - [ ] Mandatory for bugs: What is the diff? Clearly define what the
problem is, what the change is, and link to the current documentation. Only
use this for a documentation bug.

is cloned by

ACM-22475 Document Observability in backup restore scenario

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates