[MON-4129] adjust queries and relabel_configs to take into account normalization

Type: Task
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
Prometheus 3
Docs QE Status:
NEW
QE Status:
NEW
Intelligence Requested:
Market:

Sprint:
MON Sprint 266

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

The `PrometheusPossibleNarrowSelectors` alert was added to help identify label selectors misuses after the Prometheus v3 update (More details below)

Setting Prometheus/Thanos log level to "debug" (see
https://docs.openshift.com/container-platform/latest/observability/monitoring/configuring-the-monitoring-stack.html#setting-log-levels-for-monitoring-components_configuring-the-monitoring-stack)
should provide insights into the affected queries and relabeling configs.

See attached PR for how to fix.
If assistance is needed, please leave a comment.

—

With Prometheus v3, the classic histogram's "le" and summary's "quantile" labels values will be floats.

All queries (in Alerts, Recording rules, dashboards, or interactive ones) with selectors that assume "le"/"quantile" values to be integers only should be adjusted.
Same applies to Relabel Configs.

Queries:

foo_bucket{le="1"} may need to be turned into foo_bucket{le=~"1(.0)?"}
foo_bucket{le=~"1|3"} may need to be turned into foo_bucket{le=~"1|3(.0)?"}

(same applies to the "quantile" label)

Relabel configs:

    - action: foo
      regex: foo_bucket;(1|3|5|15.5)
      sourceLabels:
      - __name__
      - le

may need to be adjusted

    - action: foo
      regex: foo_bucket;(1|3|5|15.5)(\.0)?
      sourceLabels:
      - __name__
      - le

(same applies to the "quantile" label)

Also, from upstream Prometheus:

Aggregation by the `le` and `quantile` labels for vectors that contain the old and
new formatting will lead to unexpected results, and range vectors that span the
transition between the different formatting will contain additional series.
The most common use case for both is the quantile calculation via
`histogram_quantile`, e.g.
`histogram_quantile(0.95, sum by (le) (rate(histogram_bucket[10m])))`.
The `histogram_quantile` function already tries to mitigate the effects to some
extent, but there will be inaccuracies, in particular for shorter ranges that
cover only a few samples.

A warning about this should suffice, as adjusting the queries would be difficult, if not impossible. Additionally, it might complicate things further.

See attached PRs for examples.

A downstream check to help surface such misconfigurations was added. An alert will fire for configs that aren't enabled by default and that may need to be adjusted.

For more details, see https://docs.google.com/document/d/11c0Pr2-Zn3u3cjn4qio8gxFnu9dp0p9bO7gM45YKcNo/edit?tab=t.0#bookmark=id.f5p0o1s8vyjf

links to

openshift/cluster-authentication-operator#752: MON-4129: adjust Prometheus classic histograms 'le' related selectors in relabel config to accommodate the update to Prometheus v3

openshift/cluster-kube-apiserver-operator#1784: MON-4129: adjust Prometheus classic histograms 'le' related selectors in rules defs and relabel config to accommodate the update to Prometheus v3

openshift/cluster-kube-apiserver-operator#1815: MON-4129: slos: accomodate for Prometheus v3 "le" normalization

openshift/cluster-kube-apiserver-operator#1816: MON-4129: slos: move to float buckets as Prometheus v3 normalized integer->float

openshift/cluster-kube-apiserver-operator#1817: MON-4129: revert https://github.com/openshift/cluster-kube-apiserver-operator/pull/1784

openshift/cluster-openshift-apiserver-operator#611: MON-4129: adjust Prometheus classic histograms 'le' related selectors in relabel config to accommodate the update to Prometheus v3

openshift/hypershift#5508: MON-4129: adjust Prometheus classic histograms 'le' related selectors in relabel config to accommodate the update to Prometheus v3

openshift/microshift#4621: NO-ISSUE: rebase-main-4.19.0-0.nightly-2025-02-28-132549_amd64-2025-02-28_arm64-2025-03-02

openshift/microshift#4625: Rebase main 4.19.0 0.nightly 2025 02 28 132549 amd64 2025 02 28 arm64 2025 03 02

(4 links to)