Loading...

XML

Word

Printable

Type: Task
Resolution: Obsolete
Priority: Undefined
Fix Version/s: openshift-4.11
Affects Version/s: None
Component/s: None
Labels:
None

Blocked:
False
Ready:
False
Epic Link:
observability: operators should use component-base metrics package
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

To be able to alert based on the availability of the various proxy in front of the apiserver to detect network issues in between clients and the apiserver, we need to gather more insights about the current latencies and error rates of these proxies.

This use case is a good fit for SLO-based alerting (https://sre.google/workbook/alerting-on-slos/) since we want to know the current availability of the proxies and detect unavailabilities that could be caused by network issues.

To create SLO-based alerts we first need to determine each proxy SLO, so we would need to gather insight on their latency and error rate. Ideally, for the latency, we would want the 99th percentile since it allows detecting failures better.

The different types of connections to the apiserver that we want to get insights on are:

Direct connection
APIServer Service
Internal load balancer
Everything else

Since we have 4 different scenarios and 2 types of information that we want to gather, we should theoretically only send 8 series to Telemetry.

A WIP document for the telemetry request is available here: https://docs.google.com/document/d/1AJ8H2K4h3FVPfPLVzivIFYQxmJLvl3vZjXn1_9jkjqk/edit?usp=sharing

This PR creates recording rules that will then be used to send data via Telemetry: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1272

Assignee:: Antonio Ojea (Inactive)

Reporter:: Damien Grisonnet

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2021/12/17 5:42 PM

Updated:: 2022/08/26 1:57 PM

Resolved:: 2022/03/04 11:00 AM

Details

Description

Attachments

Activity

People

Dates

Hide