-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
prometheus broken Kube SD
-
False
-
None
-
False
-
Not Selected
-
NEW
-
To Do
-
MON-3156Upstream improvements
-
NEW
-
0% To Do, 17% In Progress, 83% Done
-
M
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Epic Goal
- Warn users about Prometheus<->Kubernetes API failures (unreachable API, permissions issue etc.) which can lead into silent Service Discovery failures.
- Add an alert based on the newly added metric https://github.com/prometheus/prometheus/pull/13554 that keeps track of these failures.
- Maybe a runbook explaining the main reasons behind the failures and how to fix them.
Why is this important?
- Warnings are only available as logs, logs can easily be missed and not regularly checked.
- Even with logs, sometimes, users don't know what they need to do, a runbook will be helpful.
Scenarios
- I wanted Prometheus to scrape new targets but I didn't give it the needed permissions (many slack threads about that ""Failed to watch" in:#forum-openshift-monitoring")
- I mis-configured the Kube SD.
- Prometheus cannot reach the Kube API due to some DNS changes, connectivity/network issue.
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- An alert with minimal false positives.
Dependencies (internal and external)
- ...
Previous Work (Optional):
- …
Open questions::
- ...
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>