-
Epic
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
None
-
[Investigation] ability to tune what is collected by the in-cluster monitoring stack
-
False
-
False
-
NEW
-
To Do
-
Impediment
-
NEW
Epic Goal
- Investigate options to reduce the amount of metrics collected by the platform monitoring stack.
Why is this important?
- The monitoring stack is one of the top contributors when it comes to CPU and RAM consumption.
- It can be a challenge for users to scale the stack (e.g. Prometheus would only scale vertically).
- A significant fraction of the collected data isn't actively used (e.g. metrics that aren't leveraged by alerting/recording rules, telemetry, dashboards).
- Customers that run many clusters (telco edge clusters for instance) want to collect and forward operational metrics to a central location. Given bandwidth constraints, they don't need to collect everything locally.
Scenarios
- As an OpenShift monitoring developer, I want to quantify how much CPU/RAM would be saved when the monitoring stack collects only the metrics that have an operational interest (e.g. metrics used for alerting, telemetry and dashboarding).
- As an OpenShift monitoring developer, I need to
Acceptance Criteria
- Document detailing the potential resource savings.
- Design document explaining how it could be implemented in practice (probably an OpenShift enhancement proposal)
Dependencies (internal and external)
- ...
Previous Work (Optional):
MON-1671(investigate dropped metrics & resource savings)MON-1672(investigate lower res metrics & resource savings)
Open questions::
- …
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- is related to
-
OBSDA-211 Implement scrape profiles
- In Progress