Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: ACM 2.7.4
Affects Version/s: ACM 2.7.0
Component/s: Observability
Labels:
- Obs-Core
- QE
- Trian-03

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Git Pull Request:
https://github.com/stolostron/multicluster-observability-operator/pull/1178
Intelligence Requested:
Market:

Test Coverage:

-
Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

There appears to be a bug in the RHACM observatorium dashboard with respect to "cpu requests as a percentage of allocatable cpu resource within a cluster"; our simple test showed the metric to be almost 20 times allocatable cpu resource, which is not possible.

Please see the attached PDF document for details, must-gathers also attached.

Also within the PDF, please see a locally created promql expression which attempts to give an accurate measure of "cpu requests" as a percentage of really available & allocatable cpu resource within a cluster.

Custom promql:
sum((kube_pod_container_resource_requests{resource="cpu"} * on (pod,namespace) group_left (phase) kube_pod_status_phase{phase="Running"}) * on (node) group_left (role) kube_node_role{role="app"} ) / sum(kube_node_status_allocatable{resource="cpu"} * on (node) group_left(role) kube_node_role{role="app"})

The expression attempts to be more accurate by:

Aggregating cpu requests, kube_pod_container_resource_requests{resource="cpu"} , for pods in "Running" state only & for pods running on "APP" nodes.
The cpu allocatable resource is also modified by only inclusing resources from "app" worker nodes.

We would like a query which is accurate for our scenario and can be used for alerting as well as capacity management; I have compared this with other metrics available from common dashboards on RHACM or the OCP UI.

This is a key metric for operational stability and capacity management for operating our fleet of clusters, so looking for guidance.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Cluster_Capacity_Issues.pdf
1002 kB
2023/03/09 10:11 PM

Assignee:: Subbarao Meduri

Reporter:: James Young

QA Contact:: Xiang Yin

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/03/09 10:08 PM

Updated:: 2025/09/13 9:19 AM

Resolved:: 2023/10/04 5:13 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates