Uploaded image for project: 'Red Hat Fuse'
  1. Red Hat Fuse
  2. ENTESB-14347

[Observability] Provide SOPs for SLOs

    XMLWordPrintable

Details

    • Enhancement
    • Resolution: Won't Do
    • Blocker
    • 2021-M2
    • None
    • Camel-K

    Description

      What

      Create standard operating procedures (SOPs) for addressing breaches of SLOs

      Why

      So SRE can investigate the cause of an Alert without needing extensive Service specific knowledge, and ultimately get the Service back into a good state before the SLO is breached

      How

      A SOP, in the context of RHMI Monitoring & Alerting, is a document that has a clear set of steps to troubleshoot why an Alert might be firing, and how to fix the problem. SOPs should assume the reader has a high level of OpenShift & Kubernetes knowledge, but doesn’t have much, if any, service specific knowledge. Any service specific terms or concepts relevant to the Alert should be clearly defined and explained how they are relevant to the firing Alert. The SOP should specify how to verify the issue is fixed after taking remedial action.
      An example SOP can be seen in the Appendix.

      Futher Information:

      Attachments

        Issue Links

          Activity

            People

              astefanu@redhat.com Antonin Stefanutti
              dffrench@redhat.com David Ffrench
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: