-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
False
-
False
-
?
-
No
-
?
-
?
-
?
-
Undefined
Goal: Review existing alerts, create new ones as needed, and write runbooks for all of them to help admins to monitor and operate the infrastructure specifically in the bare metal case.
Example:
Alert: Node is NotReady for 5min Meaning: The status of the node is unknow - The workloads might still be running or not. Impact: The node is not providing any compute resources anymore, no workloads can e running on it anymore. Diagnosis: Check `kubectl …` … Mitigation: There can be several causes for NotReady nodes, some common operations to restore health are: * Reboot the node * If available: Automatically reprovision the node - or manually Before doign destructive operations you might also want to ensure to capture the relevant logs and events to perform a post-mortem.
More exmaples can be found at https://github.com/kubevirt/monitoring/blob/main/runbooks/VirtAPIDown.md
- relates to
-
MON-927 Improve our alerting rules to clear confusion to what they do, the impact, and the call to action
- Closed