Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-5058

ACM Resource Optimization dashboard times out in large scale env [2.6]

XMLWordPrintable

    • 1
    • False
    • None
    • False
    • Observability Sprint 2023-11
    • Important
    • No

      Description of problem:

      ACM - Resource optimization along with several ACM dashboards today use label_values(kube_pod_info{clusterType!=\"ocp3\"},cluster) query to obtain a list of clusters to to populate cluster selection dashboard lists. The metric kube_pod_info obtains information about all pods in all clusters, which is very expensive in a large scale environment and can easily run into processing a million time series (total time series = number of clusters x number of pods on each cluster) and this results in timeouts in grafana.

      We could instead use cluster_version metric which only uses 3 time series per cluster to get the same results. This metric is also supported on all ACM versions 2.5 and above so should be completely compatible when we back port the fix.

      The metric cluster_version does not exist in OCP 3.11 so this fix will be applied to OCP 4 dashboards only.

      dbennett@redhat.comjbanerje@redhat.comakrzos@redhat.comrhn-support-xiyin FYA

      Version-Release number of selected component (if applicable):

      How reproducible: Always

      Steps to Reproduce: Access ACM - Resource Optimization dashboard in scale lab environment

      1.  
      2.  
      3. ...

      Actual results: Dashboard times out with 504 error

      Expected results: Dashboard should load quickly with a list of managed clusters

      Additional info:

              dbennett@redhat.com Disaiah Bennett
              smeduri1@redhat.com Subbarao Meduri
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: