Loading...

XML

Word

Printable

Type: Epic
Resolution: Obsolete
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Epic Name:
scalability insight
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Docs QE Status:
NEW
Epic Status:
To Do
Parent Link:
MON-3159Technical Debt
QE Status:
NEW

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

This is the first effort to solve the linked feature request.

Problem

We have several foot guns in our stack where users can create resources that over-tax our stack and cause errors or unavailability.

We have seen

Queries that are too expensive, i.e. take too long.
Alert rules that take very long to evaluate
alert rules that are repeated for many namespaces

We have guards in place for some. A long running query will time out (though it will still cause significant load). Other issues cause strange, seemingly unrelated issues. Sometimes we struggle to first pinpoint the issue and then communicate to users why they can't do that, when they (seemingly) followed supported procedures.

The first step to improve this is to actually gain easy insight into these bottlenecks and be able to alert on them.

https://github.com/thanos-io/thanos/pull/5741 is an example where a new metric is added to thanos-querier that should be useful.

We need to identify and fill gaps in our metric set and then leverage these metrics to notify users before things break.

is documented by

OBSDOCS-182 Insight into scalability bottlenecks

Closed

relates to

MON-3255 Thanos dashboards in the console

To Do

OBSDA-213 Clarify scalability expectations and support

In Progress

Assignee:: Unassigned

Reporter:: Jan Fajerski

QA Contact:: Junqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/10/28 12:08 PM

Updated:: 2023/10/24 1:06 PM

Resolved:: 2023/10/24 1:06 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates