Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Normal
Fix Version/s: 2023-09-18 - API
Affects Version/s: None
Component/s: None
Labels:
- refined
- short-list

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

If kube_deployment_status_replicas_updated{namespace=~"rhsm.*"} remains at 0 for any timeseries, this indicates the deployment is not progressing (stale deployment), and should be investigated. We can start with 30 minutes as the `for` attribute, so that we get alerted if any deployment stays in the state for 30 consecutive minutes.

This may be due to a quota issue, in which case, the quota should be increased as needed. If it's due to a limitrange ratio, then the failing deployment should have its memory limit or cpu limit increased to satisfy the ratio via CPU_LIMIT or MEMORY_LIMIT.

The `deployment`and `namespace` labels can be used to generate a URL directly to the deployment's events on the openshift console:

$CONSOLE_BASE_URL/k8s/ns/{{ $labels.namespace }}/deployments/{{ $labels.deployment }}/events

(replace $CONSOLE_BASE_URL)

Done:

playbook added to app-interface for remediation per above.
alert added to app-interface for both stage and prod.
alert tests added to app-interface for both stage and prod.
for stage, #swatch-alerts (slack) should be the place that is notified.

mentioned on

Merge request - SWATCH-1638: Add alerting for deadlocked deployments

Solved by commit d661c1b7913cccc38847ff3d61b45d3f8a7f165d.

Assignee:: Alex Wood

Reporter:: Kevin Howell

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/08/28 3:44 PM

Updated:: 2023/09/18 3:42 PM

Resolved:: 2023/09/13 7:25 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates