-
Epic
-
Resolution: Done
-
Normal
-
None
-
None
-
Reproducable Measurement Methods and KPIs for OpenShift control-plane components
-
Future Sustainability
-
0% To Do, 0% In Progress, 100% Done
-
False
-
-
False
-
Not Selected
-
M
-
None
-
18
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Epic Goal
- Create an easy to use, easy to interpret metrics collection that is able to efficiently collect performance data for the control-plane of an OpenShift Cluster
- The tooling should incorporate CPU metrics on its running speed, its utilization at peak and over time
- The CPU metrics need to be aggregated for the non-control plane (isolated) and control-plane (management) coreset if a Performance Profile is found
- The CPU metrics need to clearly dictate a guidance on which components are causing CPU / latency spikes
- The tooling should incorporate memory metrics on its utilization at peak and over time
- The memory metrics need to be aggregated for the non-control plane (isolated) and control-plane (management) coreset if a Performance Profile is found
- The memory metrics need to clearly dictate a guidance on which components are causing allocation / heap spikes
- The tooling should incorporate per-control plane component metrics
- Each component that is a valid control-plane component should be measurable individually
- The tooling should generate useful data based on existing collection agents (should leverage prometheus data sources wherever possible)
- The data is ideally available as a cluster dashboard in the openshift console, but another intermediary human-readable format is also fine
Why is this important?
- The 1C Target (meaning the control plane components consuming at most 1 physical core / 2 virtual threads with hyperthreading) likely will have to be tracked over multiple releases and we need a reproducible environment together with a set of measurements that can be retriggered for every release to continously drive analysis and reporting for the Initiative
- When we calculate our next moves for the Initiative of coming closer to 1C we need numbers which can clearly deduce that our work is succeeding and gives us insight into what we need to tackle next
- We need to continously control that we observe no regressions for the 1C target.
Scenarios
For further guidelines or existing measurements that are oriented by telco, please reference the Test Plan in https://docs.google.com/document/d/1wSbxdFpmqqwDSHDjLmdCuQy1DjWGuQaMuO0Gp2SJQl0/edit?usp=sharing
Acceptance Criteria
- Environment KPIs and Guide must be created and sustainable maintained in a Knowledge source of Edge Enablement that is shareable with other OCP Teams
- Release Technical Enablement - Documentation for the KPIs / metrics included must be present and there must have been an upskilling on these in Edge Enablement
- The measurements should be based on prometheus metric reports ideally accompanied with a custom process metrics recorder such as https://github.com/Avielyo10/prom/ as much as possible to make later integration easier
- The metrics / KPIs should follow best practices and need to be explained so that its easier to reason why they are necessary when using them as action indicators for other core teams in openshift.
- The metrics / KPIs should be collectable in one sweep of prometheus so that we can integrate them into Scripts and CI later
Dependencies (internal and external)
- N/A
Open questions::
- What existing tooling from Perf&Scale is usable that can be run as is that might solve parts of this epic?
Done Checklist
- DEV - Environment KPIs and Guide: <link to meaningful PR or GitHub Issue>
- DEV - Measurement Methodology / Test Plan: <link to meaningful PR or GitHub Issue>
- DEV - Edge Enablement Upskilling: <link to Presentation / Recording>
- is depended on by
-
OCPEDGE-795 Confirmation and Prioritization of Resource Saving Potentials in SNO vDUs
-
- Closed
-
-
OCPEDGE-1006 Create Measurement Profile for vDU Metal Control Plane Usage Tests
-
- Closed
-