-
Epic
-
Resolution: Obsolete
-
Normal
-
None
-
None
-
None
-
None
-
scalability insight
-
False
-
None
-
False
-
Not Selected
-
NEW
-
To Do
-
MON-3159Technical Debt
-
NEW
This is the first effort to solve the linked feature request.
Problem
We have several foot guns in our stack where users can create resources that over-tax our stack and cause errors or unavailability.
We have seen
- Queries that are too expensive, i.e. take too long.
- Alert rules that take very long to evaluate
- alert rules that are repeated for many namespaces
We have guards in place for some. A long running query will time out (though it will still cause significant load). Other issues cause strange, seemingly unrelated issues. Sometimes we struggle to first pinpoint the issue and then communicate to users why they can't do that, when they (seemingly) followed supported procedures.
The first step to improve this is to actually gain easy insight into these bottlenecks and be able to alert on them.
https://github.com/thanos-io/thanos/pull/5741 is an example where a new metric is added to thanos-querier that should be useful.
We need to identify and fill gaps in our metric set and then leverage these metrics to notify users before things break.
- is documented by
-
OBSDOCS-182 Insight into scalability bottlenecks
- Closed