Uploaded image for project: 'OpenShift GitOps'
  1. OpenShift GitOps
  2. GITOPS-4873

Add Argo CD Health checks for alerts

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Operator
    • None
    • Add Argo CD Health checks for alerts
    • False
    • Hide

      None

      Show
      None
    • False
    • To Do

      Epic Goal

      • We currently have a PrometheusRule to alert when an Argo CD application is out-of-sync, the goal of this Epic is to expand it to support alerts on health status (Degraded, etc)

      Why is this important?

      • Customers rely on OpenShift Monitoring to proactively notify via alerts when resources are in a bad state. While alerting when applications are out of sync is useful, alerting on health state provides much needed additional capabilities.
      • Having an alert on health status enables OpenShift Monitoring to leverage existing, as well as custom, Argo CD health checks to alert when resources are in a bad state even when those resources themselves do not have their own PrometheusRules defined. This essentially provides additional alerting to customers at no additional effort.

      Scenarios

      1. Application is in a Degraded state, raise a critical alert
      2. Application is Progressing for more then 10 minutes, raise a warning alert
      3. Application is a non-healthy state (not Healthy, Suspended, Degraded or Progressing), raise a warning alert. Note Degraded and Progressing covered by #1 and #2 hence excluded here.

      Acceptance Criteria (Mandatory)

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • Alerts raised as per scenarios

      Dependencies (internal and external)

      1. None

      Previous Work (Optional):

      1. Work done on Argo CD Out-Of-Sync alert

      Open questions::

      1. Would be good to solicit feedback from customers and field about alert levels as well as timing for Progressing (I picked 10 minutes but that's arbritary)
      2. Do alert levels and timing need to be configurable in operator?
      •  

      Done Checklist

      • Acceptance criteria are met
      • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
      • User Journey automation is delivered
      • Support and SRE teams are provided with enough skills to support the feature in production environment

              Unassigned Unassigned
              gnunn@redhat.com Gerald Nunn
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: