Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: Provisioning
Labels:
None

Sprint:
EnVision Sprint 32, EnVision Sprint 33
Epic Link:
HMS-1712

I am struggling with our major SLO: reservation success rate. As you can see on our production cluster, success rate is currently below 70%. While I see about half of these failures being done by GCP which is in development, another half of the failures are AWS.
Our SLO alert currently fires "randomly" because we use prometheus counter which resets every now and than. I created a new metric that is calculated by stats process on the background (is gauge metric) and that will work, but it will cause the SLO to have a very long recovery rate because our traffic is pretty low.
Therefore I propose to change our major SLO from "success rate" to "processing rate" similarly what we do with sources availability checks. We cannot do anything if a customer fires 10 unsuccessful launches because they revoke a permission. An alert will only alert us, but we cannot do anything about it. On the other hand, reservations being not processed is an important metric and alert is appropriate - workers are perhaps down, kafka is down, jobs are getting killed (OOM) or something like that.
If you agree, I would prepare two changes:

Update our stats job to also provide "pending" reservations in the last 24h/28d windows.
Update our dashboard to show both success rate and pending rate (the new SLO).
Update our SLO documents and alert to reflect the change.

Instead of alert "success rate > 70%" I suggest "processed rate > 90%". We can afford to increase the alert threshold in this case since almost all of our reservations are being processed correctly.

mentioned on

Merge request - Drop SLO Processing rate

Assignee:: Lukáš Zapletal

Reporter:: Lukáš Zapletal

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/07/14 11:17 AM

Updated:: 2023/09/12 10:01 AM

Resolved:: 2023/08/01 8:38 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates