-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
OpenShift Console alerting triggered by various Quay events
-
To Do
-
40% To Do, 0% In Progress, 60% Done
Goal: Provide alerting capabilities for Operator-managed Quay and Clair deployments so that alerts are triggered within the OpenShift console and at the Kubernetes level.
Problem:
- as of today Quay doesn't feature built-in monitoring and alerting capabilities
- AppSRE team might want further alerting to trigger operational changes
- OpenShift built-in alerting capabilities do not include Quay specific alerts including the 2 OCP related operators running on-cluster
Why is this important:
- aligned to two major initiatives on the Quay side: a) deeper integration into OCP and b) focus on day2 ops to address our key target persona
Dependencies (internal and external):
- OCP monitoring and alerting stack
- OCP console alerting capabilities and requirements
- Quay Operator driving integration with the two items above
Prioritized epics + deliverables (in scope / not in scope):
- As a Quay admin I'm getting alerted if thresholds for
- Outstanding Builds
- Outstanding security scans
- Pull / Push, authentication, build times, etc...(all from AppSRE / what is a 'healthy' Quay)?
- As an OpenShift cluster admin I can configure thresholds for various metrics of Quay
- Kubernetes events for Operator-related Kubernetes-level actions
Following RFE has been requested by customer
• As a Quay admin I can configure thresholds for various metrics of Quay
• As a Quay admin I'm getting alerted if thresholds for DB, CPU, mem, storage, etc. have been reached
• As an OpenShift cluster admin I can configure thresholds for various metrics of Quay
• As an OpenShift cluster admin I can configure thresholds for various metrics of Quay operators running on my cluster
• As an OpenShift cluster admin I'm getting alerted if the thresholds I've configured have been reached
• As an OpenShift console user I'm getting alerted for metrics which have an impact to my project, pod, etc
• Quay should export metrics regarding DB, CPU, mem, storage etc
• Quay should export metrics for error rates - push errors, pull errors, tag errors, authentication errors, write errors, read errors etc
• Clustered Quay should export information about the whole cluster -> node status, dead nodes, HA status etc
• Quay should export metrics about repository mirror failures, error rates and status
Previous Work:
- Quay Prometheus endpoint
- Quay operators (3 of them)
- AppSRE alerting efforts (QRE-26, QRE-127,
QRE-48, etc.)
Open questions:
- Do we need / want to allow configuration of thresholds or would it be hardcoded by us?
- If the latter, how can we figure out threshold which are applicable to all different sizes of Quay and OCP clusters?
- Identify the most noteworthy metrics (talk to AppSRE)?
- is blocked by
-
PROJQUAY-280 Quay leverages the OCP built-in monitoring capabilities
- Closed
-
PROJQUAY-508 Quay Build Service CPU abuse mitigation
- Closed
-
PROJQUAY-217 Configure PagerDuty alert for unhealthy Clair workers
- Closed
- is documented by
-
PROJQUAY-1840 Console Alerting docs
- Closed
- is related to
-
PROJQUAY-1484 Quay/Clair Default Alerting Rules
- New