-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
Operator status condition for operator health
-
False
-
False
-
To Do
Epic Goal
- Give operators a more precise and controllable way of communicating their healthiness of an operator
Why is this important?
- Currently operator health is defined by the health of OLM's top-level object ClusterServiceVersion which in turn derives it's healthiness / readiness state from the healthiness of all encapsulated components (ServiceAccounts, CRDs, Deployments
- the above definition is too coarse and does not provide the ability to reflect operator-specific health states that cannot be not expressed by low-level Kubernetes component health at the operator deployment level (see scenarios)
- Expressing health of complex operators via healthiness and readiness probes will lead to undesired side effects, like pods being rescheduled by Kubernetes
Scenarios
- An operator may create and track several resources post-deployment that aren't part of its own controller setup but constitute a larger add-on control plane, the overall health of the offering provided by the operator needs to take this into account
- An operator may depend on resources outside of the cluster to provide reliable service, the overall health of the offering provided by the operator needs to take availability of these services into account
- Cluster administrators expect operators to provide reasonable abstraction for complex, multi-component services like CNV or ODF and thus expect a single, overall health conditions that reports the healthiness of an entire add-on control plane
- Cluster fleet administrators expect OpenShift to be able to report a overall health status that includes the aggregate health status of all installed cluster extensions
Acceptance Criteria
- A operator author must be able to employ custom logic to denote overall operator health and readiness that is not tied to healthiness of the operator controller pods alone
Dependencies (internal and external)
- ...
Previous Work (Optional):
- Operator Status Condition for Upgrade readiness (OLM-1809)
Open questions::
- …
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- is related to
-
OPRUN-2364 Top-level OLM metrics (and alerts) for over-all operator health and operator upgrade status
- New
- relates to
-
OPRUN-3197 [UPSTREAM] Extension health #390
- To Do