XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Major
Fix Version/s: 2021Q2 Plan, openshift-4.9
Affects Version/s: None
Component/s: None
Labels:
- doc-ack
- observability
- px-ack
- qe-ack

Epic Name:
Monitoring of node resources
Epic Status:
Done
Activity Type:
Product / Portfolio Work
Parent Link:
OCPPLAN-7523Node-Stabilization
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:
None
Ready:
False
Size:
None

Target Version:
None
Release Blocker:
None

Epic Goal

The overall goal of this EPIC is to enhance node related metrics and alerts to give customers an earlier indicator of when the stability of a certain node is compromised.

Why is this important?

Customers have encountered node instabilities in past OpenShift releases. Those caused them to take action when a node may already be unable to schedule workloads. We have to enhance metrics and alerting to reduce the time between the failure and its interception.

User stories:

See issues in this EPIC.

Acceptance Criteria:

All stories of this EPIC are fulfilling their DoD.

Dependencies (internal and external)

The monitoring team may provide required information to be able to complete the stories.

Previous Work (Optional):

See linked issues

Open questions:

Which kind of documentation do we have to provide to our customers?

Done Checklist:

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Follow-up Ideas:

Association between the process and container id, maybe use eBPF for this.
Create a prometheus query to correlate memory hungry pods, and nodes going not ready
RHCOS: kdump support supposedly in 4.9

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

is cloned by

OCPNODE-639 Alerting of node resources

Closed

links to

openshift/must-gather#239: Collecting the output of `oc adm top resource`

openshift/openshift-docs#36001: Node release notes for 4.9

Assignee:: Sascha Grunert

Reporter:: Gaurav Singh

Need Info From:: None

Contributors:: None

QA Contact:: Sunil Choudhary

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Created:: 2021/03/10 2:30 PM

Updated:: 2025/07/16 1:19 PM

Resolved:: 2021/09/22 4:18 PM

Details

Description

Epic Goal

Why is this important?

User stories:

Acceptance Criteria:

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist:

Follow-up Ideas:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates