Uploaded image for project: 'Operator Runtime'
  1. Operator Runtime
  2. OPRUN-2364

Top-level OLM metrics (and alerts) for over-all operator health and operator upgrade status

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • top-level-health-and-status-metrics
    • False
    • False
    • To Do

      Epic Goal

      • Have two top level prometheus metrics - one for an aggregated operaor helath and one for an aggregated operator upgrade status. In order to answer the questions: Is my cluster upgrade completed and are my operators heathy?

      Why is this important?

      • As a customer with multiple operators, after installing them, I wonder to monitor the cluster at a whole, without monitoring the individual opertors - to avoid that I need to continuously change my alerting rules around second level operators

      Scenarios

      1. As a customer having OCP, ODF/OCS, and CNV deployed I'd like to only monitor the base cluster (CVO) and one additional metric to measure all second level operators health and upgrade status.

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • ...

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      1. Operators have conditions that expose their health, and we have metrics about upgrade phases, the delta of this epic is: Expose both elements as standard prom metrics in order to feed more data to our consistent prom based alerting

      Open questions::

      1. ...

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              Unassigned Unassigned
              fdeutsch@redhat.com Fabian Deutsch
              Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: