Loading...

Type: Spike
Resolution: Unresolved
Priority: Normal
Fix Version/s: ACM 2.16.0
Affects Version/s: None
Component/s: Documentation, Observability
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Acceptance Criteria:
Hide

Provide the required acceptance criteria using this template.

...
Show
Provide the required acceptance criteria using this template. ...
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

📊 PromQL Query Analysis & Metrics Gaps
Issue Summary:

The customer is observing gaps in their calculated CPU usage metrics, derived from the following PromQL query:

sum by (pod) ((rate(container_cpu_usage_seconds_total{pod=~"dv-6d66b99b49-7mcq5", cluster="ocp", namespace="vxtest-dc13",} [2m]) * 100) scalar(count(container_cpu_usage_seconds_total{pod=~"dv-6d66b99b49-7mcq5", cluster="test2", namespace="test-dc13",})))

Technical Finding:
Insufficient Rate Range

The query uses a rate() range of [2m] on a timeseries that is assumed to have a 1-minute scrape interval.

Per Prometheus best practices, the range vector for rate() should be at least 4 times the scrape interval to ensure stability and resilience against missed scrapes.

Minimum Recommended Range= 4×Scrape Interval
For a 1-minute interval: 4×1m=4m. The current 2m range is insufficient and is the likely cause of the observed data gaps.

Reference: https://www.robustperception.io/what-range-should-i-use-with-rate/

Actionable Recommendations:

Correct the PromQL Query: Advise the customer to increase the rate() range to [5m] (or at least [4m]) in their queries to stabilize the calculation:

sum by (pod) ((rate(container_cpu_usage_seconds_total{...} [5m]) * 100) scalar(count(container_cpu_usage_seconds_total{...})))

Verify Raw Data Integrity: Request the customer to confirm if the underlying metrics storage is complete. This helps isolate the issue to the query versus a collection problem.

Question: Are they also seeing missing data points (gaps) when querying the raw timeseries (e.g., container_cpu_usage_seconds_total

{...}

without any functions like rate() or sum())?

📚 Appendix: PromQL Best Practices
The 4× rule is critical for reliable multi-cluster observability (MCO). Here are general guidelines:

Scrape Interval	Minimum Range (4×)	Recommended Standard Range	Example Usage (Query)
15 seconds	60s	1m	rate(http_requests_total[1m])
30 seconds	120s	2m	rate(container_cpu_usage_seconds_total[2m])
1 minute	240s	5m	rate(node_cpu_seconds_total[5m])

Application Examples

Scenario	Interval	Rule	PromQL/YAML Example
Service Dashboard	15s	4×15s=60s	rate(http_requests_total[1m])
Container Alerting	30s	4×30s=120s	rate(container_cpu_usage_seconds_total[2m])
Recording Rule	10s	4×10s=40s	record: database_queries:rate → expr: rate(database_queries_total[1m])

[ ] Mandatory: Add the required version to the Fix version/s field.
ACM 2.16

[ ] Mandatory: Choose the type of documentation change or review.

[ ] We need to update to an existing topic

[ ] We need to add a new document to an existing section

[ ] We need a whole new section; this is a function not documented before and doesn't belong in any current section

[ ] We need an Operator Advisory review and approval

[ ] We need a z-Stream (Errata) Advisory and Release note for MCE and/or ACM

[ ] Mandatory: Find the link to where the documentation update should go and add it to the recommended changes. You can either use the published doc or the staged repo for this step:
Note: As the feature and doc is understood, this recommendation may change. If this is new documentation, link to the section where you think it should be placed.
Customer Portal published version
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.12
Doc staged repo within the ACM Workspace:
https://github.com/stolostron/rhacm-docs

[ ] Mandatory for GA content:

[ ] Add steps, the diff, known issue, and/or other important conceptual information in the following space:

[ ] *Add Required access level *(example, Cluster Administrator) for the user to complete the task:

[ ] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?)

[ ] Add link to dev story here:

[ ] Mandatory for bugs: What is the diff? Clearly define what the problem is, what the change is, and link to the current documentation. Only use this for a documentation bug.

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates