-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
Add Argo CD Health checks for alerts
-
False
-
-
False
-
To Do
-
-
Epic Goal
- We currently have a PrometheusRule to alert when an Argo CD application is out-of-sync, the goal of this Epic is to expand it to support alerts on health status (Degraded, etc)
Why is this important?
- Customers rely on OpenShift Monitoring to proactively notify via alerts when resources are in a bad state. While alerting when applications are out of sync is useful, alerting on health state provides much needed additional capabilities.
- Having an alert on health status enables OpenShift Monitoring to leverage existing, as well as custom, Argo CD health checks to alert when resources are in a bad state even when those resources themselves do not have their own PrometheusRules defined. This essentially provides additional alerting to customers at no additional effort.
Scenarios
- Application is in a Degraded state, raise a critical alert
- Application is Progressing for more then 10 minutes, raise a warning alert
- Application is a non-healthy state (not Healthy, Suspended, Degraded or Progressing), raise a warning alert. Note Degraded and Progressing covered by #1 and #2 hence excluded here.
Acceptance Criteria (Mandatory)
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- Alerts raised as per scenarios
Dependencies (internal and external)
- None
Previous Work (Optional):
- Work done on Argo CD Out-Of-Sync alert
Open questions::
- Would be good to solicit feedback from customers and field about alert levels as well as timing for Progressing (I picked 10 minutes but that's arbritary)
- Do alert levels and timing need to be configurable in operator?
Done Checklist
- Acceptance criteria are met
- Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
- User Journey automation is delivered
- Support and SRE teams are provided with enough skills to support the feature in production environment