Uploaded image for project: 'Observability Documentation'
  1. Observability Documentation
  2. OBSDOCS-65

Update OCP docs to replace sample PromQL query that might time out and cause excessive load on Prometheus

XMLWordPrintable

    • OBSDOCS (Mar 25 - Apr 15) #251

      Description of problem:

      In the "Troubleshooting monitoring issues" section, the documentation tells users that they can run the following PromQL query to identify high-cardinality metrics:

      topk(10,count by (job)({{}name{}=~".+"}))

      This query is expensive which may trigger timeouts or even out-of-memory crashes.

      Instead we can document these queries:

      • "topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))" => top-10 jobs exposing the highest number of samples.
      • "topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))" => top-10 jobs that created most of the series in the last hour (helps to identify series churn).

      The queries can be tuned to return data only for the Platform or UWM Prometheus (e.g. '... scrape_samples_post_metric_relabeling

      {prometheus="prometheus/k8s"}

      ' or '... scrape_samples_post_metric_relabeling

      {prometheus="prometheus/user-workload-monitoring"}

      ').

      The documentation also says to:

      • Check the time series database (TSDB) status using the Prometheus HTTP API for more information about which labels are creating the most time series. Doing so requires cluster administrator privileges.

      which can also be a good indicator of high cardinality metrics, but this should appear first in the list.

       

      In addition, update the CLI steps in the procedure to bring the command format in line with current OCP docs code guidelines.

              rhn-support-bburt Brian Burt
              rhn-support-bburt Brian Burt
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: