-
Story
-
Resolution: Done
-
Normal
-
None
Description of problem:
In the "Troubleshooting monitoring issues" section, the documentation tells users that they can run the following PromQL query to identify high-cardinality metrics:
topk(10,count by (job)({{}name{}=~".+"}))
This query is expensive which may trigger timeouts or even out-of-memory crashes.
Instead we can document these queries:
- "topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))" => top-10 jobs exposing the highest number of samples.
- "topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))" => top-10 jobs that created most of the series in the last hour (helps to identify series churn).
The queries can be tuned to return data only for the Platform or UWM Prometheus (e.g. '... scrape_samples_post_metric_relabeling
{prometheus="prometheus/k8s"}' or '... scrape_samples_post_metric_relabeling
{prometheus="prometheus/user-workload-monitoring"}').
The documentation also says to:
- Check the time series database (TSDB) status using the Prometheus HTTP API for more information about which labels are creating the most time series. Doing so requires cluster administrator privileges.
which can also be a good indicator of high cardinality metrics, but this should appear first in the list.
In addition, update the CLI steps in the procedure to bring the command format in line with current OCP docs code guidelines.
- links to