Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: quay
Labels:

Epic Name:
OpenShift Console alerting triggered by various Quay events
Epic Status:
To Do
Hierarchy Progress Bar:

40% To Do, 0% In Progress, 60% Done
Git Pull Request:
https://github.com/quay/quay-operator/pull/401

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Goal: Provide alerting capabilities for Operator-managed Quay and Clair deployments so that alerts are triggered within the OpenShift console and at the Kubernetes level.

Problem:

as of today Quay doesn't feature built-in monitoring and alerting capabilities
AppSRE team might want further alerting to trigger operational changes
OpenShift built-in alerting capabilities do not include Quay specific alerts including the 2 OCP related operators running on-cluster

Why is this important:

aligned to two major initiatives on the Quay side: a) deeper integration into OCP and b) focus on day2 ops to address our key target persona

Dependencies (internal and external):

OCP monitoring and alerting stack
OCP console alerting capabilities and requirements
Quay Operator driving integration with the two items above

Prioritized epics + deliverables (in scope / not in scope):

As a Quay admin I'm getting alerted if thresholds for
- Outstanding Builds
- Outstanding security scans
- Pull / Push, authentication, build times, etc...(all from AppSRE / what is a 'healthy' Quay)?
As an OpenShift cluster admin I can configure thresholds for various metrics of Quay
Kubernetes events for Operator-related Kubernetes-level actions

Following RFE has been requested by customer

• As a Quay admin I can configure thresholds for various metrics of Quay
• As a Quay admin I'm getting alerted if thresholds for DB, CPU, mem, storage, etc. have been reached
• As an OpenShift cluster admin I can configure thresholds for various metrics of Quay
• As an OpenShift cluster admin I can configure thresholds for various metrics of Quay operators running on my cluster
• As an OpenShift cluster admin I'm getting alerted if the thresholds I've configured have been reached
• As an OpenShift console user I'm getting alerted for metrics which have an impact to my project, pod, etc
• Quay should export metrics regarding DB, CPU, mem, storage etc
• Quay should export metrics for error rates - push errors, pull errors, tag errors, authentication errors, write errors, read errors etc
• Clustered Quay should export information about the whole cluster -> node status, dead nodes, HA status etc
• Quay should export metrics about repository mirror failures, error rates and status

Previous Work:

Quay Prometheus endpoint
Quay operators (3 of them)
AppSRE alerting efforts (QRE-26, QRE-127, ~~QRE-48~~, etc.)

Open questions:

Do we need / want to allow configuration of thresholds or would it be hardcoded by us?
If the latter, how can we figure out threshold which are applicable to all different sizes of Quay and OCP clusters?
Identify the most noteworthy metrics (talk to AppSRE)?

is blocked by

PROJQUAY-280 Quay leverages the OCP built-in monitoring capabilities

Closed

PROJQUAY-508 Quay Build Service CPU abuse mitigation

Closed

PROJQUAY-217 Configure PagerDuty alert for unhealthy Clair workers

Closed

is documented by

PROJQUAY-1840 Console Alerting docs

Closed

is related to

PROJQUAY-1484 Quay/Clair Default Alerting Rules

Assignee:: Syed Ahmed

Reporter:: Dirk Herrmann (Inactive)

QA Contact:: Dongbo Yan

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2020/02/13 5:35 AM

Updated:: 2025/06/14 3:07 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates