-
Bug
-
Resolution: Done
-
Normal
-
ACM 2.7.0
There appears to be a bug in the RHACM observatorium dashboard with respect to "cpu requests as a percentage of allocatable cpu resource within a cluster"; our simple test showed the metric to be almost 20 times allocatable cpu resource, which is not possible.
Please see the attached PDF document for details, must-gathers also attached.
Also within the PDF, please see a locally created promql expression which attempts to give an accurate measure of "cpu requests" as a percentage of really available & allocatable cpu resource within a cluster.
Custom promql:
sum((kube_pod_container_resource_requests{resource="cpu"} * on (pod,namespace) group_left (phase) kube_pod_status_phase{phase="Running"}) * on (node) group_left (role) kube_node_role{role="app"} ) / sum(kube_node_status_allocatable{resource="cpu"} * on (node) group_left(role) kube_node_role{role="app"})
The expression attempts to be more accurate by:
- Aggregating cpu requests, kube_pod_container_resource_requests{resource="cpu"} , for pods in "Running" state only & for pods running on "APP" nodes.
- The cpu allocatable resource is also modified by only inclusing resources from "app" worker nodes.
We would like a query which is accurate for our scenario and can be used for alerting as well as capacity management; I have compared this with other metrics available from common dashboards on RHACM or the OCP UI.
This is a key metric for operational stability and capacity management for operating our fleet of clusters, so looking for guidance.