Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-3800

Alert for broken Prometheus Kube Service Discovery

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • prometheus broken Kube SD
    • False
    • None
    • False
    • Not Selected
    • NEW
    • To Do
    • MON-3156Upstream improvements
    • NEW
    • 0% To Do, 17% In Progress, 83% Done
    • M

      OCP/Telco Definition of Done
      Epic Template descriptions and documentation.

      <--- Cut-n-Paste the entire contents of this description into your new Epic --->

      Epic Goal

      • Warn users about Prometheus<->Kubernetes API failures (unreachable API, permissions issue etc.) which can lead into silent Service Discovery failures.
      • Add an alert based on the newly added metric https://github.com/prometheus/prometheus/pull/13554 that keeps track of these failures.
      • Maybe a runbook explaining the main reasons behind the failures and how to fix them.

      Why is this important?

      • Warnings are only available as logs, logs can easily be missed and not regularly checked.
      • Even with logs, sometimes, users don't know what they need to do, a runbook will be helpful.

      Scenarios

      1. I wanted Prometheus to scrape new targets but I didn't give it the needed permissions (many slack threads about that ""Failed to watch" in:#forum-openshift-monitoring")
      2. I mis-configured the Kube SD.
      3. Prometheus cannot reach the Kube API due to some DNS changes, connectivity/network issue.

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • An alert with minimal false positives.

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      Open questions::

      1. ...

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              rh-ee-amrini Ayoub Mrini
              rh-ee-amrini Ayoub Mrini
              Tai Gao Tai Gao
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: