-
Task
-
Resolution: Won't Do
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
No
-
---
-
---
-
-
WHAT
We don't have SRE alerts covering scenarios where users hit service limits, because there's no action that our SRE would perform in such cases.
We expose metrics that customers could use to monitor their own usage against the limits, so we should give them a good way to be able to tell when they are hitting the limits themselves.
WHY
So that customers can figure out thy they're seeing degraded performance (or crashes, etc) for some of their client applications.
HOW
- Ensure sufficient metrics are exposed to monitor usage against the service limits
- Look at the Grafana dashboard template to see if any additions can be made to it to better highlight usage against the service limits (our internal usage limits dashboards in observatorium might be a good place to look for inspiration).
- Consider a follow-up blog post to this one, focused more on alerting
DONE
Users have a clear path to observe their usage against the limits, and don't need to ask us for help.
- is blocked by
-
MGDSTRM-10148 Publish per-broker connection count/creation rate and limits metrics
- Tasking and Estimation