-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
False
-
-
False
-
-
If kube_deployment_status_replicas_updated{namespace=~"rhsm.*"} remains at 0 for any timeseries, this indicates the deployment is not progressing (stale deployment), and should be investigated. We can start with 30 minutes as the `for` attribute, so that we get alerted if any deployment stays in the state for 30 consecutive minutes.
This may be due to a quota issue, in which case, the quota should be increased as needed. If it's due to a limitrange ratio, then the failing deployment should have its memory limit or cpu limit increased to satisfy the ratio via CPU_LIMIT or MEMORY_LIMIT.
The `deployment`and `namespace` labels can be used to generate a URL directly to the deployment's events on the openshift console:
$CONSOLE_BASE_URL/k8s/ns/{{ $labels.namespace }}/deployments/{{ $labels.deployment }}/events
(replace $CONSOLE_BASE_URL)
Done:
- playbook added to app-interface for remediation per above.
- alert added to app-interface for both stage and prod.
- alert tests added to app-interface for both stage and prod.
- for stage, #swatch-alerts (slack) should be the place that is notified.