-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
OVN health monitoring using graphana
-
False
-
None
-
False
-
Not Selected
-
To Do
-
0% To Do, 0% In Progress, 100% Done
-
---
-
0
-
0
Epic Goal
Customers require a proactive method of viewing the current state of OVN, and would like indicators of what determines a healthy or unhealthy network overlay in terms of signals from attributes that can be looked at in a lightweight manner and event out to Prometheus.
The metrics / alerts needed to support this goal are unknown and untested, therefore we are going to "dogfeed" a new Graphana dashboard to support and engineering in-order to get feedback and iterate. Following feedback and in another release, an epic will cover delivery to CUs.
Steps by support/engineering to investigate a CUs cluster:
- Gather a slice of prom dbs from now to some time in the past
- This can be accomplished by must-gather or a separate tool in network-tooling repo
- Upload this data to prometheus instance and show the current and past functional and perf health of ovn-k.
- Need to find out how we can have a long running instance of prom that support and engineering can consume.
- Determine what other data can be pulled from a must-gather into the grafana dashboard to be displayed alongside prometheus metric.
Steps by CU to investigate their cluster (accomplished in following epic):
1. Install Graphana
2. Install our dashboard from Graphanas dashboard "store"
Why is this important?
- We need a systematic method for determining OVN-K health. First, this is to aid engineering and support in troubleshooting issues. Second, so customer may better understand the state of their cluster and avoid potential issues.
Scenarios
Indicators of sub-par OVN health to inform pro-active management.
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- ...
Dependencies (internal and external)
- ...
Previous Work (Optional):
POC in 4.13:
Open questions::
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>