-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
None
-
False
-
-
False
-
5
-
None
-
None
-
NetObserv - Sprint 276, NetObserv - Sprint 277, NetObserv - Sprint 278
Review other (non-netobserv) metrics available out there, and see if we can leverage our alerting+health mechanism on them too
E.g:
- ingress errors (haproxy_server_http_responses_total)
- ingress performance degrading (? haproxy_server_http_average_response_latency_milliseconds)
- ingress connections coming close to capacity
- apiserver errors (code:apiserver_request_total:rate5m{apiserver="kube-apiserver"})
- apiserver tls handshake errors (cluster:apiserver_tls_handshake_errors_total:rate5m{apiserver="kube-apiserver"})
- apiserver performance degrading (??)
- ovn error (ovnkube_node_cni_request_duration_seconds_count{err!="false"})
See also these RFEs: RFE-2935, RFE-8004
The latter provides some examples of promQL to alert on. Also, it's an interesting point to rely also on node-exporter metrics as they don't overlap with netobserv own drop metrics (kernel drops vs device drops)