-
Story
-
Resolution: Done
-
Normal
-
ACM 2.9.0
-
5
-
False
-
None
-
False
-
-
ACM-1578 - Maintain the Observability Stack
-
-
-
Observability Sprint 2023-11, Observability Sprint 2023-15
-
No
Value Statement
Due to various reasons, the ACM compactor can become unhealthy and stop compactions. This could have disastrous consequences on the long term health of the system. The story is to alert OCP administrator when the compactor becomes unhealthy.
The alerts are modeled after how RHOBS monitors compactor health here
Specifically,
ACMThanosCompactHalted, critical, [5m], fires if compactor halted
ACMThanosCompactHighCompactionFailures, warning, [15m], fires if the compaction failure rate is > 5%
ACMThanosCompactBucketHighOperationFailures, warning, [15m], fires if bucket operation failure rate is > 5%
ACMThanosCompactHasNotRun, warning, fires if compactor has not uploaded anything in last 24 hours.
It also delays its execution by 6 hours the first time the rule is added.
(4 hrs for receivers to create a block + 2 hours for compactor to run).
jbanerje@redhat.com sberens@redhat.com - please review and comment.
Definition of Done for Engineering Story Owner (Checklist)
- ...
Development Complete
- The code is complete.
- Functionality is working.
- Any required downstream Docker file changes are made.
Tests Automated
- [ ] Unit/function tests have been automated and incorporated into the
build. - [ ] 100% automated unit/function test coverage for new or changed APIs.
Secure Design
- [ ] Security has been assessed and incorporated into your threat model.
Multidisciplinary Teams Readiness
- [ ] Create an informative documentation issue using the [Customer
Portal_doc_issue template](
https://github.com/stolostron/backlog/issues/new?assignees=&labels=squad%3Adoc&template=doc_issue.md&title=),
and ensure doc acceptance criteria is met. Link the development issue to
the doc issue. - [ ] Provide input to the QE team, and ensure QE acceptance criteria
(established between story owner and QE focal) are met.
Support Readiness
- [ ] The must-gather script has been updated.
- relates to
-
ACM-8498 New feature: Compactor alerts for Multicluster Observability
- Closed