-
Bug
-
Resolution: Done-Errata
-
Undefined
-
4.10
Description of problem:
A customer has reported that the Thanos querier pods would be OOM-killed when loading the API performance dashboard with large time ranges (e.g. >= 1 week)
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Always for the customer
Steps to Reproduce:
1. Open the "API performance" dashboard in the admin console. 2. Select a time range of 2 weeks. 3.
Actual results:
The dashboard fails to refresh and the thanos-query pods are killed.
Expected results:
The dashboard loads without error.
Additional info:
The issue arises for the customer because they have very large clusters (hundreds of nodes) which generate lots of metrics. In practice the queries executed by the dashboard are costly because they access lots of series (probably > tens of thousands). To make it more efficient, the "upstream" dashboard from kubernetes-monitoring/kubernetes-mixin uses recording rules [1] instead of raw queries. While it decreases a bit the accuracy (one can only distinguish between read & write API requests), it's the only solution to avoid overloading the Thanos query endpoint. [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/05a58f765eda05902d4f7dd22098a2b870f7ca1e/dashboards/apiserver.libsonnet#L50-L75
- blocks
-
OCPBUGS-25922 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers
- Closed
- is cloned by
-
OCPBUGS-25922 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers
- Closed
- is duplicated by
-
OCPBUGS-22441 Thanos Querier high CPU and memory usage till OOM
- Closed
- is related to
-
OCPBUGS-22441 Thanos Querier high CPU and memory usage till OOM
- Closed
- links to
-
RHEA-2023:5006 rpm
(2 links to)