Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-2164

Service monitoring and alerting for Hypershift addon

XMLWordPrintable

    • False
    • None
    • False

      Epic Goal

      • Define SLIs for components that will be used by Service Delivery.
      • Code instrumentation for agreed upon SLIs, expose metrics.
      • Define alerting rules for SLIs.
      • Determine starting SLO based on aggregation of our SLIs.

      Why is this important?

      • Meet SLA requirements that will be established as part of SD.
      • Service monitoring and alerting will be essential for quick RCA and resolution for service disruptions across environments.

      Scenarios

      Metric type PagerDuty Name Description Equation gate to pager duty
      Bool (binary) YES Addon controller If pod is healthy returns 0, otherwise returns 1 Skips the pager duty call if Installation is 1
      Bool (binary) YES Hypershift-operator If pod is healthy returns 0, otherwise returns 1 Skips the pager duty call if Installation is 1
      Bool (binary) YES External DNS If pod is healthy returns 0, otherwise returns 1 Skips the pager duty call if Installation is 1
      Count NO Restart count (24hrs) Is the number of restarts in 24 hours Pick a threshold?!?
      Bool (binary) NO Installation / Upgrade If an Installation (upgrade) is occurring return 1, otherwise returns 0  

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.

      Dependencies (internal and external)

      1. Prometheus

      Previous Work (Optional):

      1. Hypershift addon document

      Open questions::

      1. Are there a set of signals SLI's that service delivery requires or suggests?
      2. How many of the signals can be just rules? (no code change required)

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build{}

              rokejungrh Roke Jung
              jpacker@redhat.com Joshua Packer
              Juliana Hsu Juliana Hsu (Inactive)
              Roke Jung Roke Jung
              David Huynh David Huynh
              Joydeep Banerjee Joydeep Banerjee
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: