Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: 4.13.z
Affects Version/s: 4.10
Component/s: kube-apiserver
Labels:
- api

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, observability dashboards used expensive queries to show data which caused frequent timeouts on clusters with a large number of nodes. With this release, observability dashboards use recording rules that are precalculated to ensure reliability on clusters with a large number of nodes.

Show
* Previously, observability dashboards used expensive queries to show data which caused frequent timeouts on clusters with a large number of nodes. With this release, observability dashboards use recording rules that are precalculated to ensure reliability on clusters with a large number of nodes.
Release Note Type:
Bug Fix
Release Note Status:
In Progress
Target Version:

4.13.z
Target Backport Versions:

4.12.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

This is a clone of issue ~~OCPBUGS-3986~~. The following is the description of the original issue:
—
Description of problem:

A customer has reported that the Thanos querier pods would be OOM-killed when loading the API performance dashboard with large time ranges (e.g. >= 1 week)

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always for the customer

Steps to Reproduce:

1. Open the "API performance" dashboard in the admin console.
2. Select a time range of 2 weeks.
3.

Actual results:

The dashboard fails to refresh and the thanos-query pods are killed.

Expected results:

The dashboard loads without error.

Additional info:

The issue arises for the customer because they have very large clusters (hundreds of nodes) which generate lots of metrics.
In practice the queries executed by the dashboard are costly because they access lots of series (probably > tens of thousands). To make it more efficient, the "upstream" dashboard from kubernetes-monitoring/kubernetes-mixin uses recording rules [1] instead of raw queries. While it decreases a bit the accuracy (one can only distinguish between read & write API requests), it's the only solution to avoid overloading the Thanos query endpoint.

[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/05a58f765eda05902d4f7dd22098a2b870f7ca1e/dashboards/apiserver.libsonnet#L50-L75

blocks

OCPBUGS-32241 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers

Closed

clones

OCPBUGS-3986 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers

Closed

is blocked by

OCPBUGS-3986 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers

Closed

is cloned by

OCPBUGS-32241 PromQL queries of the ""API Performance" dasboard can overload Thanos queriers

Closed

links to

KCS 7026659: Thanos querier pods gets OOMKilled when loading the API performance dashboard with large time ranges in RHOCP 4

openshift/cluster-kube-apiserver-operator#1611: OCPBUGS-25922: [release-4.13] dashboard: use recording rules for most metrics

RHBA-2024:2047 OpenShift Container Platform 4.13.z bug fix update

(2 links to)

Assignee:: Vadim Rutkovsky

Reporter:: OpenShift Prow Bot

QA Contact:: Deepak Punia (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2024/01/02 9:03 AM

Updated:: 2024/05/02 4:37 PM

Resolved:: 2024/05/02 4:37 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates