XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Normal
Fix Version/s: openshift-4.10
Affects Version/s: None
Component/s: None
Labels:
- doc-ack
- observability
- px-ack
- qe-ack

Epic Name:
Alerting of node resources
Epic Status:
Done
Activity Type:
None
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:
None
Ready:
False
Size:
None

Target Version:
None
Release Blocker:
None

Epic Goal

The overall goal of this EPIC is to enhance node related metrics and alerts to give customers an earlier indicator of when the stability of a certain node is compromised.

Why is this important?

Customers have encountered node instabilities in past OpenShift releases. Those caused them to take action when a node may already be unable to schedule workloads. We have to enhance metrics and alerting to reduce the time between the failure and its interception.

User stories:

See issues in this EPIC.

Acceptance Criteria:

All stories of this EPIC are fulfilling their DoD.

Dependencies (internal and external)

The monitoring team may provide required information to be able to complete the stories.

Previous Work (Optional):

See linked issues

Open questions:

Which kind of documentation do we have to provide to our customers?

Done Checklist:

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Follow-up Ideas:

Association between the process and container id, maybe use eBPF for this.
Create a prometheus query to correlate memory hungry pods, and nodes going not ready
RHCOS: kdump support supposedly in 4.9

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

clones

OCPNODE-520 Monitoring of node resources

Closed

Assignee:: Sascha Grunert

Reporter:: Gaurav Singh

Need Info From:: None

Contributors:: None

QA Contact:: Pravin Mali (Inactive)

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2021/07/23 2:23 PM

Updated:: 2022/08/26 2:26 PM

Resolved:: 2022/01/18 6:44 PM

Details

Description

Epic Goal

Why is this important?

User stories:

Acceptance Criteria:

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist:

Follow-up Ideas:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates