-
Spike
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
False
-
-
False
-
-
-
-
None
📊 PromQL Query Analysis & Metrics Gaps
Issue Summary:
The customer is observing gaps in their calculated CPU usage metrics, derived from the following PromQL query:
sum by (pod) ((rate(container_cpu_usage_seconds_total{pod=~"dv-6d66b99b49-7mcq5", cluster="ocp", namespace="vxtest-dc13",} [2m]) * 100) scalar(count(container_cpu_usage_seconds_total{pod=~"dv-6d66b99b49-7mcq5", cluster="test2", namespace="test-dc13",})))
Technical Finding:
Insufficient Rate Range
The query uses a rate() range of [2m] on a timeseries that is assumed to have a 1-minute scrape interval.
Per Prometheus best practices, the range vector for rate() should be at least 4 times the scrape interval to ensure stability and resilience against missed scrapes.
Minimum Recommended Range= 4×Scrape Interval
For a 1-minute interval: 4Ă—1m=4m. The current 2m range is insufficient and is the likely cause of the observed data gaps.
Reference: https://www.robustperception.io/what-range-should-i-use-with-rate/
Actionable Recommendations:
Correct the PromQL Query: Advise the customer to increase the rate() range to [5m] (or at least [4m]) in their queries to stabilize the calculation:
sum by (pod) ((rate(container_cpu_usage_seconds_total{...} [5m]) * 100) scalar(count(container_cpu_usage_seconds_total{...})))
Verify Raw Data Integrity: Request the customer to confirm if the underlying metrics storage is complete. This helps isolate the issue to the query versus a collection problem.
Question: Are they also seeing missing data points (gaps) when querying the raw timeseries (e.g., container_cpu_usage_seconds_total
{...}without any functions like rate() or sum())?
📚 Appendix: PromQL Best Practices
The 4Ă— rule is critical for reliable multi-cluster observability (MCO). Here are general guidelines:
Scrape Interval Minimum Range (4Ă—) Recommended Standard Range Example Usage (Query) 15 seconds 60s 1m rate(http_requests_total[1m]) 30 seconds 120s 2m rate(container_cpu_usage_seconds_total[2m]) 1 minute 240s 5m rate(node_cpu_seconds_total[5m])
Application Examples
Scenario Interval Rule PromQL/YAML Example Service Dashboard 15s 4×15s=60s rate(http_requests_total[1m]) Container Alerting 30s 4×30s=120s rate(container_cpu_usage_seconds_total[2m]) Recording Rule 10s 4×10s=40s record: database_queries:rate → expr: rate(database_queries_total[1m])
[ ] Mandatory: Add the required version to the Fix version/s field.
ACM 2.16
[ ] Mandatory: Choose the type of documentation change or review.
[ ] We need to update to an existing topic
[ ] We need to add a new document to an existing section
[ ] We need a whole new section; this is a function not documented before and doesn't belong in any current section
[ ] We need an Operator Advisory review and approval
[ ] We need a z-Stream (Errata) Advisory and Release note for MCE and/or ACM
[ ] Mandatory: Find the link to where the documentation update should go and add it to the recommended changes. You can either use the published doc or the staged repo for this step:
Note: As the feature and doc is understood, this recommendation may change. If this is new documentation, link to the section where you think it should be placed.
Customer Portal published version
https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.12
Doc staged repo within the ACM Workspace:
https://github.com/stolostron/rhacm-docs
[ ] Mandatory for GA content:
[ ] Add steps, the diff, known issue, and/or other important conceptual information in the following space:
[ ] *Add Required access level *(example, Cluster Administrator) for the user to complete the task:
[ ] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?)
[ ] Add link to dev story here:
[ ] Mandatory for bugs: What is the diff? Clearly define what the problem is, what the change is, and link to the current documentation. Only use this for a documentation bug.