Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-8509

How to ensure the observability data won't be lost when imported as managed hub cluster

XMLWordPrintable

    • False
    • None
    • False
    • Hide
      With observability is configured (regardless of whether hub self-management is turned ON or OFF)
       - Metrics from hub cluster are propagated to ACM Thanos
       - Alerts from hub cluster are propagated to ACM Alert Manager
       - Observability add-on is not present on the hub clusters
       - Observability components (endpoint-observability-operator and metrics collector) are deployed in open-cluster-management-observability namespace on the hub
       - The hub node selectors and tolerations are applied for observability components
       - Hub cluster is listed as "local-cluster" in the list of managed clusters in ACM Grafana dashboards
       - Managed cluster labels are available in ACM Overview dashboard
       - mTLS certificates used for propagating metrics from collector to observatorium api are automatically refreshed before they expire.

      Observability stack is not initialized in a global hub environment by MCO operator, even if a Observability CR is configured
      - A global hub environment is reliably detected by the presence of Multicluster-global-hub CR or other ways.

      When Observability is disabled (MCO CR removed), or ACM is uninstalled,
      - observability components are removed from hub cluster

      When ACM is upgraded from prior releases,
      - Observability add-on is removed (which in turn removes endpoint-observability-operator and metrics-collector)
      - Observability components are redeployed on open-cluster-management namespace

      When custom metrics are applied (add/remove metrics, recording rules, dynamic metrics)
      - metrics are automatically propagated to hub cluster, just like any other spoke cluster

      When metrics collector encounters errors to propagate metrics for 2 scrape cycles (2 x 5 min = 10 min)
      - local prometheus alert is generated to indicate metrics propagation is not working

      The value of "observability" label on "local-cluster" managed cluster object
      - has no effect on the metrics propagation behavior on the hub

      - When alert propagation is disabled via MCO annotation "mco-disable-alerting: true", hub alerts are not propagated to ACM alert manager.

      There is no change in behavior on regular managed spoke clusters.
      Show
      With observability is configured (regardless of whether hub self-management is turned ON or OFF)  - Metrics from hub cluster are propagated to ACM Thanos  - Alerts from hub cluster are propagated to ACM Alert Manager  - Observability add-on is not present on the hub clusters  - Observability components (endpoint-observability-operator and metrics collector) are deployed in open-cluster-management-observability namespace on the hub  - The hub node selectors and tolerations are applied for observability components  - Hub cluster is listed as "local-cluster" in the list of managed clusters in ACM Grafana dashboards  - Managed cluster labels are available in ACM Overview dashboard  - mTLS certificates used for propagating metrics from collector to observatorium api are automatically refreshed before they expire. Observability stack is not initialized in a global hub environment by MCO operator, even if a Observability CR is configured - A global hub environment is reliably detected by the presence of Multicluster-global-hub CR or other ways. When Observability is disabled (MCO CR removed), or ACM is uninstalled, - observability components are removed from hub cluster When ACM is upgraded from prior releases, - Observability add-on is removed (which in turn removes endpoint-observability-operator and metrics-collector) - Observability components are redeployed on open-cluster-management namespace When custom metrics are applied (add/remove metrics, recording rules, dynamic metrics) - metrics are automatically propagated to hub cluster, just like any other spoke cluster When metrics collector encounters errors to propagate metrics for 2 scrape cycles (2 x 5 min = 10 min) - local prometheus alert is generated to indicate metrics propagation is not working The value of "observability" label on "local-cluster" managed cluster object - has no effect on the metrics propagation behavior on the hub - When alert propagation is disabled via MCO annotation "mco-disable-alerting: true", hub alerts are not propagated to ACM alert manager. There is no change in behavior on regular managed spoke clusters.
    • No

      Value Statement

      the observability dashboard - ACM Cluster Overview is empty when imported as managed hub cluster. the reason is due to disable self management so that the metrics collector won't run in local-cluster. that is why the data is lost.

      Three potential solutions:

      1. since the data is related with hub cluster, it is not related with local-cluster. so even if the self management is disable, we should still collect such data. run the metrics collector instance in open-cluster-management-observability namespace. it is triggered by MCO CR.
      2. import hub cluster in hosted mode. we only support this option. And we cannot enable other addons in the global hub. Only install registration and work agent for the managed hub clusters. we can install it into a different namespace. In this case, we do not require to disable self management. it disable the capability to use policy to deploy ACM.
      3. enable the observability in the global hub cluster, so that the related metrics can be collected in the global hub thanos. then we can add the ACM cluster overview dashboard in global hub grafana (may have permission issue)

      Definition of Done for Engineering Story Owner (Checklist)

      • ...

      Development Complete

      • The code is complete.
      • Functionality is working.
      • Any required downstream Docker file changes are made.

      Tests Automated

      • [ ] Unit/function tests have been automated and incorporated into the
        build.
      • [ ] 100% automated unit/function test coverage for new or changed APIs.

      Secure Design

      • [ ] Security has been assessed and incorporated into your threat model.

      Multidisciplinary Teams Readiness

      Support Readiness

      • [ ] The must-gather script has been updated.

              smeduri1@redhat.com Subbarao Meduri
              clyang82 Chunlin Yang
              Xiang Yin Xiang Yin
              Votes:
              2 Vote for this issue
              Watchers:
              21 Start watching this issue

                Created:
                Updated:
                Resolved: