Epic Goal
- Have two top level prometheus metrics - one for an aggregated operaor helath and one for an aggregated operator upgrade status. In order to answer the questions: Is my cluster upgrade completed and are my operators heathy?
Why is this important?
- As a customer with multiple operators, after installing them, I wonder to monitor the cluster at a whole, without monitoring the individual opertors - to avoid that I need to continuously change my alerting rules around second level operators
Scenarios
- As a customer having OCP, ODF/OCS, and CNV deployed I'd like to only monitor the base cluster (CVO) and one additional metric to measure all second level operators health and upgrade status.
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- ...
Dependencies (internal and external)
- ...
Previous Work (Optional):
- Operators have conditions that expose their health, and we have metrics about upgrade phases, the delta of this epic is: Expose both elements as standard prom metrics in order to feed more data to our consistent prom based alerting
Open questions::
- ...
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- relates to
-
OPRUN-2376 Operator status condition for operator health
- New
-
RFE-1585 Operators Managed by OLM can report that they are healthy
- Accepted