Uploaded image for project: 'Network Observability'
  1. Network Observability
  2. NETOBSERV-2356

Extend health to non-netobserv metrics

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • None
    • None
    • NetObserv - Sprint 276, NetObserv - Sprint 277, NetObserv - Sprint 278

      Review other (non-netobserv) metrics available out there, and see if we can leverage our alerting+health mechanism on them too

      E.g:

      • ingress errors (haproxy_server_http_responses_total)
      • ingress performance degrading (? haproxy_server_http_average_response_latency_milliseconds)
      • ingress connections coming close to capacity
      • apiserver errors (code:apiserver_request_total:rate5m{apiserver="kube-apiserver"})
      • apiserver tls handshake errors (cluster:apiserver_tls_handshake_errors_total:rate5m{apiserver="kube-apiserver"})
      • apiserver performance degrading (??)
      • ovn error (ovnkube_node_cni_request_duration_seconds_count{err!="false"})

       

      See also these RFEs: RFE-2935, RFE-8004 

      The latter provides some examples of promQL to alert on. Also, it's an interesting point to rely also on node-exporter metrics as they don't overlap with netobserv own drop metrics (kernel drops vs device drops)

              Unassigned Unassigned
              jtakvori Joel Takvorian
              None
              None
              Mehul Modi Mehul Modi
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: