Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-288

OpenShift Console alerting triggered by various Quay events

XMLWordPrintable

    • OpenShift Console alerting triggered by various Quay events
    • To Do
    • 40% To Do, 0% In Progress, 60% Done

      Goal: Provide alerting capabilities for Operator-managed Quay and Clair deployments so that alerts are triggered within the OpenShift console and at the Kubernetes level.

      Problem:

      • as of today Quay doesn't feature built-in monitoring and alerting capabilities
      • AppSRE team might want further alerting to trigger operational changes
      • OpenShift built-in alerting capabilities do not include Quay specific alerts including the 2 OCP related operators running on-cluster

      Why is this important:

      • aligned to two major initiatives on the Quay side: a) deeper integration into OCP and b) focus on day2 ops to address our key target persona

      Dependencies (internal and external):

      • OCP monitoring and alerting stack
      • OCP console alerting capabilities and requirements
      • Quay Operator driving integration with the two items above

      Prioritized epics + deliverables (in scope / not in scope):

      • As a Quay admin I'm getting alerted if thresholds for
        • Outstanding Builds
        • Outstanding security scans
        • Pull / Push, authentication, build times, etc...(all from AppSRE / what is a 'healthy' Quay)?
      • As an OpenShift cluster admin I can configure thresholds for various metrics of Quay
      • Kubernetes events for Operator-related Kubernetes-level actions

      Following RFE has been requested by customer

      • As a Quay admin I can configure thresholds for various metrics of Quay
      • As a Quay admin I'm getting alerted if thresholds for DB, CPU, mem, storage, etc. have been reached
      • As an OpenShift cluster admin I can configure thresholds for various metrics of Quay
      • As an OpenShift cluster admin I can configure thresholds for various metrics of Quay operators running on my cluster
      • As an OpenShift cluster admin I'm getting alerted if the thresholds I've configured have been reached
      • As an OpenShift console user I'm getting alerted for metrics which have an impact to my project, pod, etc
      • Quay should export metrics regarding DB, CPU, mem, storage etc
      • Quay should export metrics for error rates - push errors, pull errors, tag errors, authentication errors, write errors, read errors etc
      • Clustered Quay should export information about the whole cluster -> node status, dead nodes, HA status etc
      • Quay should export metrics about repository mirror failures, error rates and status
       
      Previous Work:

      • Quay Prometheus endpoint
      • Quay operators (3 of them)
      • AppSRE alerting efforts (QRE-26, QRE-127, QRE-48, etc.)

      Open questions:

      • Do we need / want to allow configuration of thresholds or would it be hardcoded by us?
      • If the latter, how can we figure out threshold which are applicable to all different sizes of Quay and OCP clusters?
      • Identify the most noteworthy metrics (talk to AppSRE)?

              syahmed@redhat.com Syed Ahmed
              dirk.herrmann Dirk Herrmann (Inactive)
              Dongbo Yan Dongbo Yan
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: