Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: 4.14.0
Affects Version/s: 4.10
Component/s: kube-apiserver
Labels:
- api

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, observability dashboards used expensive queries to show data which caused frequent timeouts on clusters with a large number of nodes. With this release, observability dashboards use recording rules that are precalculated to ensure reliability on clusters with a large number of nodes. (link:https://issues.redhat.com/browse/OCPBUGS-3986[*~~OCPBUGS-3986~~*])

Show
* Previously, observability dashboards used expensive queries to show data which caused frequent timeouts on clusters with a large number of nodes. With this release, observability dashboards use recording rules that are precalculated to ensure reliability on clusters with a large number of nodes. (link: https://issues.redhat.com/browse/OCPBUGS-3986 [* OCPBUGS-3986 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

A customer has reported that the Thanos querier pods would be OOM-killed when loading the API performance dashboard with large time ranges (e.g. >= 1 week)

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always for the customer

Steps to Reproduce:

1. Open the "API performance" dashboard in the admin console.
2. Select a time range of 2 weeks.
3.

Actual results:

The dashboard fails to refresh and the thanos-query pods are killed.

Expected results:

The dashboard loads without error.

Additional info:

The issue arises for the customer because they have very large clusters (hundreds of nodes) which generate lots of metrics.
In practice the queries executed by the dashboard are costly because they access lots of series (probably > tens of thousands). To make it more efficient, the "upstream" dashboard from kubernetes-monitoring/kubernetes-mixin uses recording rules [1] instead of raw queries. While it decreases a bit the accuracy (one can only distinguish between read & write API requests), it's the only solution to avoid overloading the Thanos query endpoint.

[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/05a58f765eda05902d4f7dd22098a2b870f7ca1e/dashboards/apiserver.libsonnet#L50-L75

blocks

OCPBUGS-25922 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers

Closed

is cloned by

OCPBUGS-25922 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers

Closed

is duplicated by

OCPBUGS-22441 Thanos Querier high CPU and memory usage till OOM

Closed

is related to

OCPBUGS-22441 Thanos Querier high CPU and memory usage till OOM

Closed

links to

openshift/cluster-kube-apiserver-operator#1484: OCPBUGS-3986: dashboard: use recording rules for most metrics

RHEA-2023:5006 rpm

Thanos querier pods gets OOM-killed when loading the API performance dashboard with large time ranges in RHOCP 4

(2 links to)

Assignee:: Vadim Rutkovsky

Reporter:: Simon Pasquier

QA Contact:: Deepak Punia (Inactive)

Votes:: 2 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2022/11/22 2:31 PM

Updated:: 2024/01/03 11:12 AM

Resolved:: 2023/10/31 1:37 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates