Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-22475

Document Observability in backup restore scenario

XMLWordPrintable

    • Quality / Stability / Reliability
    • 7
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Provide the required acceptance criteria using this template.

      • ...
      Show
      Provide the required acceptance criteria using this template. ...
    • None

      1. - [ X] Mandatory: Add the required version to the Fix version/s field.

      All currently supported ACM versions

      2. - [ x] Mandatory: Choose the type of documentation change or review.

      Note: As the feature and doc is understood, this recommendation may
      change. If this is new documentation, link to the section where you think
      it should be placed.

      Customer Portal published version

      https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13
      https://github.com/stolostron/rhacm-docs

      4. - [ ] Mandatory for GA content:

      Add a new section right before/after (I prefer before, and reorder the numbering)
      https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#backup-config-policy-virt

      (1.1.14?). Configuring backup and restore for Observability

      The ACM Observability service uses an S3-compatible object store to persist all time-series data collected from managed clusters. Because Observability is a stateful service, it is sensitive to active/passive failover patterns. Special care must be taken to ensure data continuity and integrity during hub cluster migration or failover.

      1.1.14.1 Automatically Backed-Up and Restored Resources

      The Observability service automatically adds the label `cluster.open-cluster-management.io/backup` to key resources so they are included in standard ACM backup and restore processes. For details, see the Backed-Up Resources documentation: https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#backed-up-resources

      ConfigMaps:

      • observability-metrics-custom-allowlist
      • thanos-ruler-custom-rules
      • alertmanager-config
      • Any ConfigMap labeled with `grafana-custom-dashboard`

      Secrets:

      • thanos-object-storage
      • observability-server-ca-certs
      • observability-client-ca-certs
      • observability-server-certs
      • observability-grafana-certs
      • alertmanager-byo-ca
      • alertmanager-byo-cert
      • proxy-byo-ca
      • proxy-byo-cert

      1.1.14.2 Resources That Must Be Backed Up Manually

      Some Observability resources must be manually backed up and restored to ensure continuity across hub clusters:

      1. Observatorium resource
      Contains the tenant ID, which must be preserved during restore.
      ```
      oc get observatorium -n open-cluster-management-observability -o yaml > observatorium-backup.yaml
      ```

      2. MultiClusterObservability custom resource (MCO CR)
      Defines the overall Observability deployment.
      ```
      oc get mco observability -o yaml > mco-cr-backup.yaml
      ```

      1.1.14.3 Backup and Restore Procedure

      This procedure assumes that the same S3-compatible object store is used across both the primary and failover (backup) hub clusters to minimize disruption in metric data collection.

      Step 1: Shut Down the Compactor on the Primary Hub
      To prevent write conflicts and deduplication issues while working on the same objet storage, stop the Thanos compactor before starting restore on the backup hub:
      ```
      oc scale statefulset observability-thanos-compact -n open-cluster-management-observability --replicas=0
      ```
      Verify the compactor is stopped:
      ```
      oc get pods observability-thanos-compact-0 -n open-cluster-management-observability
      ```

      Step 2: Restore Backup Resources
      Restore resources from backup using the process described in the Restore Backup section (
      https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/business_continuity/index#restore-backup) section to restore resources from backup and activate it. This restores the automatically backed-up ConfigMaps and Secrets listed in section 1.1.14.1.

      Step 3: Restore the Observatorium Custom Resource
      Restore the observatorium resource to the backup hub:
      ```
      oc apply -f observatorium-backup.yaml
      ```
      This preserves the tenant ID, which is critical for maintaining continuity in metrics ingestion and querying.

      Step 4: Start Observability on the New (Restore) Hub
      Apply the backed-up MultiClusterObservability CR:
      ```
      oc apply -f mco-cr-backup.yaml
      ```
      The operator will start the observability stack and detect the existing observatorium resource, reusing the preserved tenant ID instead of creating a new one.

      Step 5: Migrate Managed Clusters to the New Hub

      Detach managed clusters from the primary hub and re-attach them to the new (restored) hub. Once attached, they will resume sending metrics to the Observability service.

      Step 6: Shut Down Observability on the Primary Hub
      After all managed clusters have been migrated:
      ```
      oc delete mco observability
      ```
      This command gracefully shuts down the observability stack on the primary hub and flushes in-memory metrics to S3 before termination.

      • [x] Add steps, the diff, known issue, and/or other important conceptual information in the following space:

      1. Dual Write Period
      During the migration process, both the primary and backup hub clusters may write to the S3 object store simultaneously (until all clusters are reattached). Because only one compactor is active, duplicate data will be eventually deduplicated by compactor on the new hub.

      2. Cluster Identity and Grafana Visualization
      Both primary and backup hubs use the default cluster name local-cluster. However, their internal cluster IDs differ, resulting in separate time series in Grafana.

      To ensure that hub metrics are collected and visible in Grafana, hub self-management must be enabled in each ACM hub. This allows the observability stack to treat the hub cluster as a managed cluster and include its metrics for collection.

      For ACM 2.14 and later: Use the local-cluster renaming feature to assign unique names to each hub. This helps disambiguate metrics for each hub cluster in Grafana.

      3. Metric Gaps During Migration
      No metrics are collected from a managed cluster between the time it is detached from the primary hub and re-attached to the backup hub. To minimize gaps, consider scripting the cluster migration for large fleets.

      • [ ] Add Required access level *(example, *Cluster Administrator) for the user to complete the task:
      • [x] Add verification at the end of the task, how does the userverify success (a command to run or a result to see?)
        1. Access Grafana on the new active hub cluster.
        2. Ensure that metrics appear for multiple clusters, including:
      • Managed clusters previously reporting to the primary hub
      • The hub cluster itself (local-cluster or renamed variant)
        Use grafana on the new active hub and verify you can query metrics from managed clusters including historic metrics.
        Success Criteria: Queries return both recent and historical metrics for the expected set of clusters, with no gaps during the failover window (beyond expected detach/attach delay).

      The doc task specifically clarifies steps required for migrating Observability as part of back up restore procedure.

      5. - [ ] Mandatory for bugs: What is the diff? Clearly define what the
      problem is, what the change is, and link to the current documentation. Only
      use this for a documentation bug.

              mdockery@redhat.com Mikela Jackson
              smeduri1@redhat.com Subbarao Meduri
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: