XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: SNO
Labels:
- 4.17-candidate
- no-qe

Epic Name:
Reproducable Measurement Methods and KPIs for OpenShift control-plane components
Epic Status:
Done
Activity Type:
Future Sustainability
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Size:
M

Target Version:

openshift-4.18
Release Blocker:
None

Story Points:
18

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Create an easy to use, easy to interpret metrics collection that is able to efficiently collect performance data for the control-plane of an OpenShift Cluster
The tooling should incorporate CPU metrics on its running speed, its utilization at peak and over time
- The CPU metrics need to be aggregated for the non-control plane (isolated) and control-plane (management) coreset if a Performance Profile is found
- The CPU metrics need to clearly dictate a guidance on which components are causing CPU / latency spikes
The tooling should incorporate memory metrics on its utilization at peak and over time
- The memory metrics need to be aggregated for the non-control plane (isolated) and control-plane (management) coreset if a Performance Profile is found
- The memory metrics need to clearly dictate a guidance on which components are causing allocation / heap spikes
The tooling should incorporate per-control plane component metrics
- Each component that is a valid control-plane component should be measurable individually
The tooling should generate useful data based on existing collection agents (should leverage prometheus data sources wherever possible)
The data is ideally available as a cluster dashboard in the openshift console, but another intermediary human-readable format is also fine

Why is this important?

The 1C Target (meaning the control plane components consuming at most 1 physical core / 2 virtual threads with hyperthreading) likely will have to be tracked over multiple releases and we need a reproducible environment together with a set of measurements that can be retriggered for every release to continously drive analysis and reporting for the Initiative
When we calculate our next moves for the Initiative of coming closer to 1C we need numbers which can clearly deduce that our work is succeeding and gives us insight into what we need to tackle next
We need to continously control that we observe no regressions for the 1C target.

Scenarios

For further guidelines or existing measurements that are oriented by telco, please reference the Test Plan in https://docs.google.com/document/d/1wSbxdFpmqqwDSHDjLmdCuQy1DjWGuQaMuO0Gp2SJQl0/edit?usp=sharing

Acceptance Criteria

Environment KPIs and Guide must be created and sustainable maintained in a Knowledge source of Edge Enablement that is shareable with other OCP Teams
Release Technical Enablement - Documentation for the KPIs / metrics included must be present and there must have been an upskilling on these in Edge Enablement
The measurements should be based on prometheus metric reports ideally accompanied with a custom process metrics recorder such as https://github.com/Avielyo10/prom/ as much as possible to make later integration easier
The metrics / KPIs should follow best practices and need to be explained so that its easier to reason why they are necessary when using them as action indicators for other core teams in openshift.
The metrics / KPIs should be collectable in one sweep of prometheus so that we can integrate them into Scripts and CI later

Dependencies (internal and external)

Open questions::

What existing tooling from Perf&Scale is usable that can be run as is that might solve parts of this epic?

Done Checklist

DEV - Environment KPIs and Guide: <link to meaningful PR or GitHub Issue>
DEV - Measurement Methodology / Test Plan: <link to meaningful PR or GitHub Issue>
DEV - Edge Enablement Upskilling: <link to Presentation / Recording>

is depended on by

OCPEDGE-795 Confirmation and Prioritization of Resource Saving Potentials in SNO vDUs

Closed

OCPEDGE-1006 Create Measurement Profile for vDU Metal Control Plane Usage Tests

Closed

Assignee:: Egli Hila

Reporter:: Jakob Moeller (Inactive)

QA Contact:: Pedro Jose Amoedo Martinez

Doc Contact:: Unassigned

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/01/12 3:18 PM

Updated:: 2025/09/16 11:23 AM

Resolved:: 2024/09/23 4:54 AM

Details

Description

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Open questions::

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates