Uploaded image for project: 'OpenShift Node'
  1. OpenShift Node
  2. OCPNODE-639

Alerting of node resources

XMLWordPrintable

    • Alerting of node resources
    • False
    • False
    • Done
    • 0% To Do, 0% In Progress, 100% Done
    • Undefined

      Epic Goal

      The overall goal of this EPIC is to enhance node related metrics and alerts to give customers an earlier indicator of when the stability of a certain node is compromised.

      Why is this important?

      Customers have encountered node instabilities in past OpenShift releases. Those caused them to take action when a node may already be unable to schedule workloads. We have to enhance metrics and alerting to reduce the time between the failure and its interception.

      User stories:

      See issues in this EPIC.

      Acceptance Criteria:

      All stories of this EPIC are fulfilling their DoD.

      Dependencies (internal and external)

      The monitoring team may provide required information to be able to complete the stories.

      Previous Work (Optional):

      See linked issues

      Open questions:

      Which kind of documentation do we have to provide to our customers?

      Done Checklist:

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

      Follow-up Ideas:

      • Association between the process and container id, maybe use eBPF for this.
      • Create a prometheus query to correlate memory hungry pods, and nodes going not ready
      • RHCOS: kdump support supposedly in 4.9

      OCP/Telco Definition of Done
      Epic Template descriptions and documentation.

              sgrunert@redhat.com Sascha Grunert
              gausingh@redhat.com Gaurav Singh
              Pravin Mali Pravin Mali (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: