-
Bug
-
Resolution: Done-Errata
-
Undefined
-
4.10
This is a clone of issue OCPBUGS-3986. The following is the description of the original issue:
—
Description of problem:
A customer has reported that the Thanos querier pods would be OOM-killed when loading the API performance dashboard with large time ranges (e.g. >= 1 week)
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Always for the customer
Steps to Reproduce:
1. Open the "API performance" dashboard in the admin console. 2. Select a time range of 2 weeks. 3.
Actual results:
The dashboard fails to refresh and the thanos-query pods are killed.
Expected results:
The dashboard loads without error.
Additional info:
The issue arises for the customer because they have very large clusters (hundreds of nodes) which generate lots of metrics. In practice the queries executed by the dashboard are costly because they access lots of series (probably > tens of thousands). To make it more efficient, the "upstream" dashboard from kubernetes-monitoring/kubernetes-mixin uses recording rules [1] instead of raw queries. While it decreases a bit the accuracy (one can only distinguish between read & write API requests), it's the only solution to avoid overloading the Thanos query endpoint. [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/05a58f765eda05902d4f7dd22098a2b870f7ca1e/dashboards/apiserver.libsonnet#L50-L75
- blocks
-
OCPBUGS-32241 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers
- Closed
- clones
-
OCPBUGS-3986 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers
- Closed
- is blocked by
-
OCPBUGS-3986 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers
- Closed
- is cloned by
-
OCPBUGS-32241 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers
- Closed
- links to
-
RHBA-2024:2047 OpenShift Container Platform 4.13.z bug fix update