Loading...

XML

Word

Printable

Type: Story
Resolution: Obsolete
Priority: Normal
Fix Version/s: OpenShift 4.14 Async, OpenShift 4.13 Async, OpenShift 4.15 Freeze
Affects Version/s: None
Component/s: Monitoring
Labels:
- stretch-goal

Story Points:
8
Blocked:
True
Blocked Reason:

Hide

Dependent on completion of https://issues.redhat.com/browse/MON-2851

Show
Dependent on completion of https://issues.redhat.com/browse/MON-2851
Ready:
False

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

This is the first effort to solve the linked feature request.

Problem

We have several foot guns in our stack where users can create resources that over-tax our stack and cause errors or unavailability.

We have seen

Queries that are too expensive, i.e. take too long.
Alert rules that take very long to evaluate
alert rules that are repeated for many namespaces

We have guards in place for some. A long running query will time out (though it will still cause significant load). Other issues cause strange, seemingly unrelated issues. Sometimes we struggle to first pinpoint the issue and then communicate to users why they can't do that, when they (seemingly) followed supported procedures.

The first step to improve this is to actually gain easy insight into these bottlenecks and be able to alert on them.

https://github.com/thanos-io/thanos/pull/5741 is an example where a new metric is added to thanos-querier that should be useful.

We need to identify and fill gaps in our metric set and then leverage these metrics to notify users before things break.

Docs

Describe (new/improved) alerts relating to scalability
Document scalability expectations

documents

MON-2851 Insight into scalability bottlenecks

Closed

Assignee:: Brian Burt

Reporter:: Brian Burt

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2023/01/09 2:19 PM

Updated:: 2024/02/19 7:48 PM

Resolved:: 2024/02/19 7:48 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates