-
Feature
-
Resolution: Won't Do
-
Blocker
-
None
What
Create standard operating procedures (SOPs) for addressing breaches of SLOs
Why
So SRE can investigate the cause of an Alert without needing extensive Service specific knowledge, and ultimately get the Service back into a good state before the SLO is breached
How
A SOP, in the context of RHMI Monitoring & Alerting, is a document that has a clear set of steps to troubleshoot why an Alert might be firing, and how to fix the problem. SOPs should assume the reader has a high level of OpenShift & Kubernetes knowledge, but doesn’t have much, if any, service specific knowledge. Any service specific terms or concepts relevant to the Alert should be clearly defined and explained how they are relevant to the firing Alert. The SOP should specify how to verify the issue is fixed after taking remedial action.
An example SOP can be seen in the Appendix.
Futher Information:
- relates to
-
ENTESB-13661 Camel K operator Level 4
- Closed