Epic Name:
[Investigation] ability to tune what is collected by the in-cluster monitoring stack
Blocked:
False
Ready:
False
Docs QE Status:
NEW
Epic Status:
To Do
Flagged:

Impediment
QE Status:
NEW

Epic Goal

Investigate options to reduce the amount of metrics collected by the platform monitoring stack.

Why is this important?

The monitoring stack is one of the top contributors when it comes to CPU and RAM consumption.
It can be a challenge for users to scale the stack (e.g. Prometheus would only scale vertically).
A significant fraction of the collected data isn't actively used (e.g. metrics that aren't leveraged by alerting/recording rules, telemetry, dashboards).
Customers that run many clusters (telco edge clusters for instance) want to collect and forward operational metrics to a central location. Given bandwidth constraints, they don't need to collect everything locally.

As an OpenShift monitoring developer, I want to quantify how much CPU/RAM would be saved when the monitoring stack collects only the metrics that have an operational interest (e.g. metrics used for alerting, telemetry and dashboarding).
As an OpenShift monitoring developer, I need to

Document detailing the potential resource savings.
Design document explaining how it could be implemented in practice (probably an OpenShift enhancement proposal)

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

is related to

OBSDA-211 Implement scrape profiles