Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-27950

Add an alert when there is a spoke cluster that does not report the watchdog alert

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • missing-cluster-metric-alert
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • To Do

      OCP/Telco Definition of Done
      https://docs.google.com/document/d/1TP2Av7zHXz4_fmeX4q9HB0m9cqSZ4F6Jd4AiVoaF_2s/edit#heading=h.gaa58bzbvwde
      Epic Template descriptions and documentation.
      https://docs.google.com/document/d/14CUCEg6hQ_jpsFzJtWo29GfFVWmun2Uivrxq3_Fkgdg/edit
      ACM-wide Product Requirements (Top-level Epics)
      https://docs.google.com/document/d/1uIp6nS2QZ766UFuZBaC9USs8dW_I5wVdtYF9sUObYKg/edit

      *<--- Cut-n-Paste the entire contents of this description into your new
      Epic --->*

      Epic Goal

      Add an alert to identify clusters that are not reporting observability data,
      which can indicate either an issue with the observability stack or the connection to the cluster.

      This can be achieved by adding a metric that reports the list of clusters managed by ACM.
      Then creating an alert that check for each cluster if it also reports the `watchdog` alert, that indicates the health of the cluster alerting.

      Why is this important?

      This will allow us to let the Admin know in a short period of time if there is an issue with getting data from a managed cluster.
      It is a health check.

      Scenarios

      A cluster network is having issue  - An admin managing the cluster with ACM will be notified that the cluster has not reported data for the last X minutes.

      Acceptance Criteria

      ...

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      1. ...

      Open questions:

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
        Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
        Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Doc issue opened with a completed template. Separate doc issue
        opened for any deprecation, removal, or any current known
        issue/troubleshooting removal from the doc, if applicable.
      • Considerations were made for Extended Update Support (EUS)

              rh-ee-vmartini Vanessa Martini
              sradco Shirly Radco
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: