Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-20325

ACM Clarify Observability in backup restore scenario

XMLWordPrintable

    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Provide the required acceptance criteria using this template.

      • ...
      Show
      Provide the required acceptance criteria using this template. ...
    • MCO Core Sprint 46, MCO Core Sprint 47
    • None

      This is original request

      Can you confirm how this will work with an Active and Passive hub cluster please?

      ANS: Addressed in the proposed documentation (see doc template)

      I want data to be collected from ALL clusters, its no good to me having data from the passive hub cluster only available after failover. Is this possible?
      If I connect Thanos in both clusters to the same S3 backing store will both sets of data be collected from the hub clusters and keyed separately or will it all be mixed up together under "local-cluster"?

      ANS: Addressed in the proposed documentation (see doc template)
      Both passive and active hub metrics will be collected as `local-cluster`, however, they appear as different time series in grafana as both these hub clusers have different cluster IDs, and can be differentiated using the clusterID attribute.

      (2.14 only) - The best approach is to use the local cluster renaming feature on both active and passive clusters so that hub cluster metrics can be independently queried.

      When we failover to the new cluster, what happens to the old cluster's data? i.e. if I open Observability in the new cluster, connected to the same S3 store, will it retrieve the data just for the current hub cluster or will it also retrieve the data from the old one and display it mixed with the new one?
      Similarly, will it only start collecting data from the new hub cluster from the point of failover? i.e. historical data is missing / lost

      ANS: The migration steps (see doc template) allow you to view new and historic data after the failover as long as the same S3 bucket is used as part of migration. Metrics data is not collected on a managed cluster during the time a managed cluster is actively detached from primary and until it is re-attached to the backup cluster. Customers can minimize this interval by scripting managed cluster migration to new hub, especially on a large fleet of managed clusters.

      rhn-support-cstark please review and let me know if you these answers require further refinement.

      1. - [ X] Mandatory: Add the required version to the Fix version/s field.

      All supported ACM versions

      2. - [ x] Mandatory: Choose the type of documentation change or review.

      Note: As the feature and doc is understood, this recommendation may
      change. If this is new documentation, link to the section where you think
      it should be placed.

      Customer Portal published version

      https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13

      https://github.com/stolostron/rhacm-docs

      4. - [ ] Mandatory for GA content:

      Add a new section right before/after (I prefer before, and reorder the numbering)
      https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#backup-config-policy-virt

      (1.1.14?). Configuring backup and restore for Observability

      The ACM Observability service uses an S3-compatible object store to persist all time-series data collected from managed clusters. Because Observability is a stateful service, it is sensitive to active/passive failover patterns. Special care must be taken to ensure data continuity and integrity during hub cluster migration or failover.

      1.1.14.1 Automatically Backed-Up and Restored Resources

      The Observability service automatically adds the label `cluster.open-cluster-management.io/backup` to key resources so they are included in standard ACM backup and restore processes. For details, see the Backed-Up Resources documentation: https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#backed-up-resources

      ConfigMaps:

      • observability-metrics-custom-allowlist
      • thanos-ruler-custom-rules
      • alertmanager-config
      • Any ConfigMap labeled with `grafana-custom-dashboard`

      Secrets:

      • thanos-object-storage
      • observability-server-ca-certs
      • observability-client-ca-certs
      • observability-server-certs
      • observability-grafana-certs
      • alertmanager-byo-ca
      • alertmanager-byo-cert
      • proxy-byo-ca
      • proxy-byo-cert

      1.1.14.2 Resources That Must Be Backed Up Manually

      Some Observability resources must be manually backed up and restored to ensure continuity across hub clusters:

      1. Observatorium resource
      Contains the tenant ID, which must be preserved during restore.
      ```
      oc get observatorium -n open-cluster-management-observability -o yaml > observatorium-backup.yaml
      ```

      2. MultiClusterObservability custom resource (MCO CR)
      Defines the overall Observability deployment.
      ```
      oc get mco observability -o yaml > mco-cr-backup.yaml
      ```

      1.1.14.3 Backup and Restore Procedure

      This procedure assumes that the same S3-compatible object store is used across both the primary and failover (backup) hub clusters to minimize disruption in metric data collection.

      Step 1: Shut Down the Compactor on the Primary Hub
      To prevent write conflicts and deduplication issues while working on the same objet storage, stop the Thanos compactor before starting restore on the backup hub:
      ```
      oc scale statefulset observability-thanos-compact -n open-cluster-management-observability --replicas=0
      ```
      Verify the compactor is stopped:
      ```
      oc get pods observability-thanos-compact-0 -n open-cluster-management-observability
      ```

      Step 2: Restore Backup Resources
      Restore resources from backup using the process described in the Restore Backup section (
      https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#restore-backup) section to restore resources from backup and activate it. This restores the automatically backed-up ConfigMaps and Secrets listed in section 1.1.14.1.

      Step 3: Restore the Observatorium Custom Resource
      Restore the observatorium resource to the backup hub:
      ```
      oc apply -f observatorium-backup.yaml
      ```
      This preserves the tenant ID, which is critical for maintaining continuity in metrics ingestion and querying.

      Step 4: Start Observability on the New (Restore) Hub
      Apply the backed-up MultiClusterObservability CR:
      ```
      oc apply -f mco-cr-backup.yaml
      ```
      The operator will start the observability stack and detect the existing observatorium resource, reusing the preserved tenant ID instead of creating a new one.

      Step 5: Migrate Managed Clusters to the New Hub

      Detach managed clusters from the primary hub and re-attach them to the new (restored) hub. Once attached, they will resume sending metrics to the Observability service.

      Step 6: Shut Down Observability on the Primary Hub
      After all managed clusters have been migrated:
      ```
      oc delete mco observability
      ```
      This command gracefully shuts down the observability stack on the primary hub and flushes in-memory metrics to S3 before termination.

      • [x] Add steps, the diff, known issue, and/or other important conceptual information in the following space:

      1. Dual Write Period
      During the migration process, both the primary and backup hub clusters may write to the S3 object store simultaneously (until all clusters are reattached). Because only one compactor is active, duplicate data will be eventually deduplicated by compactor on the new hub.

      2. Cluster Identity and Grafana Visualization
      Both primary and backup hubs use the default cluster name local-cluster. However, their internal cluster IDs differ, resulting in separate time series in Grafana.

      To ensure that hub metrics are collected and visible in Grafana, hub self-management must be enabled in each ACM hub. This allows the observability stack to treat the hub cluster as a managed cluster and include its metrics for collection.

      For ACM 2.14 and later: Use the local-cluster renaming feature to assign unique names to each hub. This helps disambiguate metrics for each hub cluster in Grafana.

      3. Metric Gaps During Migration
      No metrics are collected from a managed cluster between the time it is detached from the primary hub and re-attached to the backup hub. To minimize gaps, consider scripting the cluster migration for large fleets.

      • [ ] Add Required access level *(example, *Cluster Administrator) for the user to complete the task:
      • [x] Add verification at the end of the task, how does the userverify success (a command to run or a result to see?)
        1. Access Grafana on the new active hub cluster.
        2. Ensure that metrics appear for multiple clusters, including:
      • Managed clusters previously reporting to the primary hub
      • The hub cluster itself (local-cluster or renamed variant)
        Use grafana on the new active hub and verify you can query metrics from managed clusters including historic metrics.
        Success Criteria: Queries return both recent and historical metrics for the expected set of clusters, with no gaps during the failover window (beyond expected detach/attach delay).

      The doc task specifically clarifies steps required for migrating Observability as part of back up restore procedure.

      5. - [ ] Mandatory for bugs: What is the diff? Clearly define what the
      problem is, what the change is, and link to the current documentation. Only
      use this for a documentation bug.

              smeduri1@redhat.com Subbarao Meduri
              rhn-support-cstark Christian Stark
              Thuy Nguyen Thuy Nguyen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: