Details
-
Bug
-
Resolution: Done-Errata
-
Undefined
-
4.10
-
False
-
-
-
Bug Fix
-
Done
Description
Description of problem:
A customer has reported that the Thanos querier pods would be OOM-killed when loading the API performance dashboard with large time ranges (e.g. >= 1 week)
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Always for the customer
Steps to Reproduce:
1. Open the "API performance" dashboard in the admin console. 2. Select a time range of 2 weeks. 3.
Actual results:
The dashboard fails to refresh and the thanos-query pods are killed.
Expected results:
The dashboard loads without error.
Additional info:
The issue arises for the customer because they have very large clusters (hundreds of nodes) which generate lots of metrics. In practice the queries executed by the dashboard are costly because they access lots of series (probably > tens of thousands). To make it more efficient, the "upstream" dashboard from kubernetes-monitoring/kubernetes-mixin uses recording rules [1] instead of raw queries. While it decreases a bit the accuracy (one can only distinguish between read & write API requests), it's the only solution to avoid overloading the Thanos query endpoint. [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/05a58f765eda05902d4f7dd22098a2b870f7ca1e/dashboards/apiserver.libsonnet#L50-L75
Attachments
Issue Links
- is duplicated by
-
OCPBUGS-22441 Thanos Querier high CPU and memory usage till OOM
-
- Closed
-
- is related to
-
OCPBUGS-22441 Thanos Querier high CPU and memory usage till OOM
-
- Closed
-
- links to
-
RHEA-2023:5006 rpm