Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: OVN Kubernetes, SDN Core
Labels:
- migrated-from-sdn

Epic Name:
OVN health monitoring using graphana
Epic Status:
To Do
Activity Type:
None
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Size:
None

Target Version:
None
Release Blocker:
None

WSJF:
0

Epic Goal

Customers require a proactive method of viewing the current state of OVN, and would like indicators of what determines a healthy or unhealthy network overlay in terms of signals from attributes that can be looked at in a lightweight manner and event out to Prometheus.

The metrics / alerts needed to support this goal are unknown and untested, therefore we are going to "dogfeed" a new Graphana dashboard to support and engineering in-order to get feedback and iterate. Following feedback and in another release, an epic will cover delivery to CUs.

Steps by support/engineering to investigate a CUs cluster:

Gather a slice of prom dbs from now to some time in the past
1. This can be accomplished by must-gather or a separate tool in network-tooling repo
Upload this data to prometheus instance and show the current and past functional and perf health of ovn-k.
1. Need to find out how we can have a long running instance of prom that support and engineering can consume.
Determine what other data can be pulled from a must-gather into the grafana dashboard to be displayed alongside prometheus metric.

Steps by CU to investigate their cluster (accomplished in following epic):
1. Install Graphana

2. Install our dashboard from Graphanas dashboard "store"

Why is this important?

We need a systematic method for determining OVN-K health. First, this is to aid engineering and support in troubleshooting issues. Second, so customer may better understand the state of their cluster and avoid potential issues.

Scenarios

Indicators of sub-par OVN health to inform pro-active management.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

POC in 4.13:

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Assignee:: Unassigned

Reporter:: Martin Kennelly

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2023/03/20 1:24 PM

Updated:: 2025/06/12 7:45 AM

Details

Description

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty