-
Bug
-
Resolution: Done
-
Major
-
ACM 2.6.7
-
1
-
False
-
None
-
False
-
-
-
Observability Sprint 2023-11
-
Important
-
No
Description of problem:
ACM - Resource optimization along with several ACM dashboards today use label_values(kube_pod_info{clusterType!=\"ocp3\"},cluster) query to obtain a list of clusters to to populate cluster selection dashboard lists. The metric kube_pod_info obtains information about all pods in all clusters, which is very expensive in a large scale environment and can easily run into processing a million time series (total time series = number of clusters x number of pods on each cluster) and this results in timeouts in grafana.
We could instead use cluster_version metric which only uses 3 time series per cluster to get the same results. This metric is also supported on all ACM versions 2.5 and above so should be completely compatible when we back port the fix.
The metric cluster_version does not exist in OCP 3.11 so this fix will be applied to OCP 4 dashboards only.
dbennett@redhat.comjbanerje@redhat.comakrzos@redhat.comrhn-support-xiyin FYA
Version-Release number of selected component (if applicable):
How reproducible: Always
Steps to Reproduce: Access ACM - Resource Optimization dashboard in scale lab environment
- ...
Actual results: Dashboard times out with 504 error
Expected results: Dashboard should load quickly with a list of managed clusters
Additional info:
- clones
-
ACM-5057 ACM Resource Optimization dashboard times out in large scale env [2.7]
- Closed