-
Task
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
EnVision Sprint 32, EnVision Sprint 33
I am struggling with our major SLO: reservation success rate. As you can see on our production cluster, success rate is currently below 70%. While I see about half of these failures being done by GCP which is in development, another half of the failures are AWS.
Our SLO alert currently fires "randomly" because we use prometheus counter which resets every now and than. I created a new metric that is calculated by stats process on the background (is gauge metric) and that will work, but it will cause the SLO to have a very long recovery rate because our traffic is pretty low.
Therefore I propose to change our major SLO from "success rate" to "processing rate" similarly what we do with sources availability checks. We cannot do anything if a customer fires 10 unsuccessful launches because they revoke a permission. An alert will only alert us, but we cannot do anything about it. On the other hand, reservations being not processed is an important metric and alert is appropriate - workers are perhaps down, kafka is down, jobs are getting killed (OOM) or something like that.
If you agree, I would prepare two changes:
- Update our stats job to also provide "pending" reservations in the last 24h/28d windows.
- Update our dashboard to show both success rate and pending rate (the new SLO).
- Update our SLO documents and alert to reflect the change.
Instead of alert "success rate > 70%" I suggest "processed rate > 90%". We can afford to increase the alert threshold in this case since almost all of our reservations are being processed correctly.
- mentioned on