Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Normal
Fix Version/s: OpenShift 4.14 Async, OpenShift 4.11 Async, OpenShift 4.12 Async, OpenShift 4.13 Async, OpenShift 4.15 Async, OpenShift 4.16 Freeze
Affects Version/s: None
Component/s: Monitoring
Labels:
- stretch-goal

Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2063063

Sprint:
OBSDOCS (Mar 25 - Apr 15) #251

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

In the "Troubleshooting monitoring issues" section, the documentation tells users that they can run the following PromQL query to identify high-cardinality metrics:

topk(10,count by (job)({{}name{}=~".+"}))

This query is expensive which may trigger timeouts or even out-of-memory crashes.

Instead we can document these queries:

"topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))" => top-10 jobs exposing the highest number of samples.
"topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))" => top-10 jobs that created most of the series in the last hour (helps to identify series churn).

The queries can be tuned to return data only for the Platform or UWM Prometheus (e.g. '... scrape_samples_post_metric_relabeling

{prometheus="prometheus/k8s"}

' or '... scrape_samples_post_metric_relabeling

{prometheus="prometheus/user-workload-monitoring"}

').

The documentation also says to:

Check the time series database (TSDB) status using the Prometheus HTTP API for more information about which labels are creating the most time series. Doing so requires cluster administrator privileges.

which can also be a good indicator of high cardinality metrics, but this should appear first in the list.

In addition, update the CLI steps in the procedure to bring the command format in line with current OCP docs code guidelines.

links to

openshift/openshift-docs#74236: OBSDOCS-65: update-monitoring-troubleshooting-sample-code

openshift/openshift-docs#74359: [enterprise-4.13] OBSDOCS-65: update-monitoring-troubleshooting-sample-code

openshift/openshift-docs#74360: [enterprise-4.14] OBSDOCS-65: update-monitoring-troubleshooting-sample-code

openshift/openshift-docs#74361: [enterprise-4.16] OBSDOCS-65: update-monitoring-troubleshooting-sample-code

openshift/openshift-docs#74362: [enterprise-4.15] OBSDOCS-65: update-monitoring-troubleshooting-sample-code

openshift/openshift-docs#74364: [enterprise-4.12] OBSDOCS-65: update-monitoring-troubleshooting-sample-code

(1 links to)