-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
None
-
False
-
OCPSTRAT-554 - Improving error handling, propagation, collection, and disambiguation for users
-
-
-
0
-
0
Having a look at our Prometheus rules and comparing them to the runbook entries, it seems we have a few missing entries. Using yq and comparing the entries I've found to what's in the runbook repo, the following Prometheus rules are missing runbook entries:
- ExtremelyHighIndividualControlPlaneMemory
- HighOverallControlPlaneMemory
- KubeletHealthState
- MCDPivotError
- MCDRebootError
- SystemMemoryExceedsReservation
Done When:
- Each of the following Prometheus rules has an entry in the runbook repository.
Notes:
- I've excluded MCDDrainError (as MCCDrainError) because that effort is being tracked here: https://issues.redhat.com/browse/MCO-88.
- I've also excluded MCDRebootError since there is additional investigative work that needs to be done. This is being tracked in https://issues.redhat.com/browse/MCO-203.
- The yq command I used to generate the above list is: $ yq eval-all '.spec[][].rules[].alert' ./install/0000_90_machine-config-operator_01_prometheus-rules.yaml | sort | uniq