Uploaded image for project: 'OpenShift SDN'
  1. OpenShift SDN
  2. SDN-3848

OVN health monitoring using graphana

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • OVN Kubernetes, SDN Core
    • None
    • OVN health monitoring using graphana
    • False
    • None
    • False
    • Not Selected
    • To Do
    • 0% To Do, 0% In Progress, 100% Done
    • ---
    • 0
    • 0

      Epic Goal

      Customers require a proactive method of viewing the current state of OVN, and would like indicators of what determines a healthy or unhealthy network overlay in terms of signals from attributes that can be looked at in a lightweight manner and event out to Prometheus.

      The metrics / alerts needed to support this goal are unknown and untested, therefore we are going to "dogfeed" a new Graphana dashboard to support and engineering in-order to get feedback and iterate. Following feedback and in another release, an epic will cover delivery to CUs.

      Steps by support/engineering to investigate a CUs cluster:

      1. Gather a slice of prom dbs from now to some time in the past 
        1. This can be accomplished by must-gather or a separate tool in network-tooling repo
      2. Upload this data to prometheus instance and show the current and past functional and perf health of ovn-k.
        1. Need to find out how we can have a long running instance of prom that support and engineering can consume.
      3. Determine what other data can be pulled from a must-gather into the grafana dashboard to be displayed alongside prometheus metric.  

      Steps by CU to investigate their cluster (accomplished in following epic):
      1. Install Graphana

      2. Install our dashboard from Graphanas dashboard "store"

      Why is this important?

      • We need a systematic method for determining OVN-K health.  First, this is to aid engineering and support in troubleshooting issues.  Second, so customer may better understand the state of their cluster and avoid potential issues.  

      Scenarios

      Indicators of sub-par OVN health to inform pro-active management.

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • ...

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      POC in 4.13:

      1. https://issues.redhat.com/browse/SDN-3622
      2. https://issues.redhat.com/browse/SDN-3820

      Open questions::

       

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              Unassigned Unassigned
              mkennell@redhat.com Martin Kennelly
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: