-
Task
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
The `PrometheusPossibleNarrowSelectors` alert was added to help identify label selectors misuses after the Prometheus v3 update (More details below)
Setting Prometheus/Thanos log level to "debug" (see
https://docs.openshift.com/container-platform/latest/observability/monitoring/configuring-the-monitoring-stack.html#setting-log-levels-for-monitoring-components_configuring-the-monitoring-stack)
should provide insights into the affected queries and relabeling configs.
See attached PR for how to fix.
If assistance is needed, please leave a comment.
—
With Prometheus v3, the classic histogram's "le" and summary's "quantile" labels values will be floats.
All queries (in Alerts, Recording rules, dashboards, or interactive ones) with selectors that assume "le"/"quantile" values to be integers only should be adjusted.
Same applies to Relabel Configs.
Queries:
foo_bucket{le="1"} may need to be turned into foo_bucket{le=~"1(.0)?"} foo_bucket{le=~"1|3"} may need to be turned into foo_bucket{le=~"1|3(.0)?"}
(same applies to the "quantile" label)
Relabel configs:
- action: foo regex: foo_bucket;(1|3|5|15.5) sourceLabels: - __name__ - le may need to be adjusted - action: foo regex: foo_bucket;(1|3|5|15.5)(\.0)? sourceLabels: - __name__ - le
(same applies to the "quantile" label)
Also, from upstream Prometheus:
Aggregation by the `le` and `quantile` labels for vectors that contain the old and new formatting will lead to unexpected results, and range vectors that span the transition between the different formatting will contain additional series. The most common use case for both is the quantile calculation via `histogram_quantile`, e.g. `histogram_quantile(0.95, sum by (le) (rate(histogram_bucket[10m])))`. The `histogram_quantile` function already tries to mitigate the effects to some extent, but there will be inaccuracies, in particular for shorter ranges that cover only a few samples.
A warning about this should suffice, as adjusting the queries would be difficult, if not impossible. Additionally, it might complicate things further.
See attached PRs for examples.
A downstream check to help surface such misconfigurations was added. An alert will fire for configs that aren't enabled by default and that may need to be adjusted.
For more details, see https://docs.google.com/document/d/11c0Pr2-Zn3u3cjn4qio8gxFnu9dp0p9bO7gM45YKcNo/edit?tab=t.0#bookmark=id.f5p0o1s8vyjf
- links to