-
Task
-
Resolution: Obsolete
-
Undefined
-
None
-
None
-
None
-
Improvement
-
False
-
False
-
To be able to alert based on the availability of the various proxy in front of the apiserver to detect network issues in between clients and the apiserver, we need to gather more insights about the current latencies and error rates of these proxies.
This use case is a good fit for SLO-based alerting (https://sre.google/workbook/alerting-on-slos/) since we want to know the current availability of the proxies and detect unavailabilities that could be caused by network issues.
To create SLO-based alerts we first need to determine each proxy SLO, so we would need to gather insight on their latency and error rate. Ideally, for the latency, we would want the 99th percentile since it allows detecting failures better.
The different types of connections to the apiserver that we want to get insights on are:
- Direct connection
- APIServer Service
- Internal load balancer
- Everything else
Since we have 4 different scenarios and 2 types of information that we want to gather, we should theoretically only send 8 series to Telemetry.
A WIP document for the telemetry request is available here: https://docs.google.com/document/d/1AJ8H2K4h3FVPfPLVzivIFYQxmJLvl3vZjXn1_9jkjqk/edit?usp=sharing
This PR creates recording rules that will then be used to send data via Telemetry: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1272