Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-24724

Doc some configuration recommendation from scraping metrics

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Provide the required acceptance criteria using this template.

      • ...
      Show
      Provide the required acceptance criteria using this template. ...
    • None

      📊 PromQL Query Analysis & Metrics Gaps
      Issue Summary:

      The customer is observing gaps in their calculated CPU usage metrics, derived from the following PromQL query:

      sum by (pod) ((rate(container_cpu_usage_seconds_total{pod=~"dv-6d66b99b49-7mcq5", cluster="ocp", namespace="vxtest-dc13",} [2m]) * 100) scalar(count(container_cpu_usage_seconds_total{pod=~"dv-6d66b99b49-7mcq5", cluster="test2", namespace="test-dc13",})))
      

      Technical Finding:
      Insufficient Rate Range

      The query uses a rate() range of [2m] on a timeseries that is assumed to have a 1-minute scrape interval.

      Per Prometheus best practices, the range vector for rate() should be at least 4 times the scrape interval to ensure stability and resilience against missed scrapes.

      Minimum Recommended Range= 4×Scrape Interval
      For a 1-minute interval: 4Ă—1m=4m. The current 2m range is insufficient and is the likely cause of the observed data gaps.

      Reference: https://www.robustperception.io/what-range-should-i-use-with-rate/

      Actionable Recommendations:

      Correct the PromQL Query: Advise the customer to increase the rate() range to [5m] (or at least [4m]) in their queries to stabilize the calculation:

      sum by (pod) ((rate(container_cpu_usage_seconds_total{...} [5m]) * 100) scalar(count(container_cpu_usage_seconds_total{...})))
      

      Verify Raw Data Integrity: Request the customer to confirm if the underlying metrics storage is complete. This helps isolate the issue to the query versus a collection problem.

      Question: Are they also seeing missing data points (gaps) when querying the raw timeseries (e.g., container_cpu_usage_seconds_total

      {...}

      without any functions like rate() or sum())?

      📚 Appendix: PromQL Best Practices
      The 4Ă— rule is critical for reliable multi-cluster observability (MCO). Here are general guidelines:

      Scrape Interval	Minimum Range (4Ă—)	Recommended Standard Range	Example Usage (Query)
      15 seconds	60s	1m	rate(http_requests_total[1m])
      30 seconds	120s	2m	rate(container_cpu_usage_seconds_total[2m])
      1 minute	240s	5m	rate(node_cpu_seconds_total[5m])
      

      Application Examples

      Scenario	Interval	Rule	PromQL/YAML Example
      Service Dashboard	15s	4Ă—15s=60s	rate(http_requests_total[1m])
      Container Alerting	30s	4Ă—30s=120s	rate(container_cpu_usage_seconds_total[2m])
      Recording Rule	10s	4×10s=40s	record: database_queries:rate → expr: rate(database_queries_total[1m])
      

      [ ] Mandatory: Add the required version to the Fix version/s field.
      ACM 2.16

      [ ] Mandatory: Choose the type of documentation change or review.

      [ ] We need to update to an existing topic

      [ ] We need to add a new document to an existing section

      [ ] We need a whole new section; this is a function not documented before and doesn't belong in any current section

      [ ] We need an Operator Advisory review and approval

      [ ] We need a z-Stream (Errata) Advisory and Release note for MCE and/or ACM

      [ ] Mandatory: Find the link to where the documentation update should go and add it to the recommended changes. You can either use the published doc or the staged repo for this step:
      Note: As the feature and doc is understood, this recommendation may change. If this is new documentation, link to the section where you think it should be placed.
      Customer Portal published version
      https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.12
      Doc staged repo within the ACM Workspace:
      https://github.com/stolostron/rhacm-docs

      [ ] Mandatory for GA content:

      [ ] Add steps, the diff, known issue, and/or other important conceptual information in the following space:

      [ ] *Add Required access level *(example, Cluster Administrator) for the user to complete the task:

      [ ] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?)

      [ ] Add link to dev story here:

      [ ] Mandatory for bugs: What is the diff? Clearly define what the problem is, what the change is, and link to the current documentation. Only use this for a documentation bug.

              rhn-support-cstark Christian Stark
              rhn-support-cstark Christian Stark
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: