Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-25527

Expand runbook with missing alarms of ACM Observability

XMLWordPrintable

    • Product / Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • None

      Note: Doc team updates the current version of the documentation and the
      two previous versions (n-2), but we address *only high-priority, or
      customer-reported issues* for -2 releases in support.
      Describe the changes in the doc and link to your dev story:

      1. - [ ] Mandatory: Add the required version to the Fix version/s field.

      ACM 2.14, 2.15

      2. - [ ] Mandatory: Choose the type of documentation change or review.

      • [ ] We need to update to an existing topic
      • [ ] We need to add a new document to an existing section
      • [ ] We need a whole new section; this is a function not
        documented before and doesn't belong in any current section
      • [ ] We need an Operator Advisory review and approval
      • [ ] We need a z-Stream (Errata) Advisory and Release note for
        MCE and/or ACM

      3. - [ ] Mandatory:

      4. - [ ] Mandatory for GA content:

      • [ ] Add steps, the diff, known issue, and/or other important
        conceptual information in the following space:
      • [ ] *Add Required access level *(example, *Cluster
        Administrator*) for the user to complete the task:
      • [ ] Add verification at the end of the task, how does the user
        verify success (a command to run or a result to see?)
      • [ ] Add link to dev story here:

      5. - [ ] Mandatory for bugs: 

      Alerts that are generated by OpenShift and associated operators are publicly documented in runbooks repo but some of these alarms about ACM observability are missing: https://github.com/openshift/runbooks/tree/master/alerts 

      Telco partners and customers require these documented procedures as a part of their Day-2 Operations on how to react to alarms. So, we need to some help about these alarms to be documented.
      This is the list of alarms that I am talking about:

      • ACMMetricsCollectorFederationError
      • ACMMetricsCollectorForwardRemoteWriteError
      • ACMRemoteWriteError
      • ACMThanosCompactHalted
      • ACMUWLMetricsCollectorFederationError
      • ACMUWLMetricsCollectorForwardRemoteWriteError

      It will be great if we can have some documentation for each of those alerts like: https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/ClusterVersionOperatorDown.md

              Unassigned Unassigned
              skoksal@redhat.com Sarp Koksal
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: