Uploaded image for project: 'Subscription Watch'
  1. Subscription Watch
  2. SWATCH-1638

Create a prometheus alert for non-progressing deployments

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False

      If kube_deployment_status_replicas_updated{namespace=~"rhsm.*"} remains at 0 for any timeseries, this indicates the deployment is not progressing (stale deployment), and should be investigated. We can start with 30 minutes as the `for` attribute, so that we get alerted if any deployment stays in the state for 30 consecutive minutes.

      This may be due to a quota issue, in which case, the quota should be increased as needed. If it's due to a limitrange ratio, then the failing deployment should have its memory limit or cpu limit increased to satisfy the ratio via CPU_LIMIT or MEMORY_LIMIT.

      The `deployment`and `namespace` labels can be used to generate a URL directly to the deployment's events on the openshift console:

      $CONSOLE_BASE_URL/k8s/ns/{{ $labels.namespace }}/deployments/{{ $labels.deployment }}/events

      (replace $CONSOLE_BASE_URL)

      Done:

      • playbook added to app-interface for remediation per above.
      • alert added to app-interface for both stage and prod.
      • alert tests added to app-interface for both stage and prod.
      • for stage, #swatch-alerts (slack) should be the place that is notified. 

              awood1@redhat.com Alex Wood
              khowell@redhat.com Kevin Howell
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: