-
Epic
-
Resolution: Done
-
Normal
-
None
-
None
-
Alerting of node resources
-
False
-
False
-
Done
-
0% To Do, 0% In Progress, 100% Done
-
Undefined
Epic Goal
The overall goal of this EPIC is to enhance node related metrics and alerts to give customers an earlier indicator of when the stability of a certain node is compromised.
Why is this important?
Customers have encountered node instabilities in past OpenShift releases. Those caused them to take action when a node may already be unable to schedule workloads. We have to enhance metrics and alerting to reduce the time between the failure and its interception.
User stories:
See issues in this EPIC.
Acceptance Criteria:
All stories of this EPIC are fulfilling their DoD.
Dependencies (internal and external)
The monitoring team may provide required information to be able to complete the stories.
Previous Work (Optional):
See linked issues
Open questions:
Which kind of documentation do we have to provide to our customers?
Done Checklist:
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
Follow-up Ideas:
- Association between the process and container id, maybe use eBPF for this.
- Create a prometheus query to correlate memory hungry pods, and nodes going not ready
- RHCOS: kdump support supposedly in 4.9
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
- clones
-
OCPNODE-520 Monitoring of node resources
- Closed