XMLWordPrintable

    • Icon: Feature Feature
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • False
    • False
    • ?
    • No
    • ?
    • ?
    • ?
    • 0
    • 0% 0%
    • Undefined

      Goal: Review existing alerts, create new ones as needed, and write runbooks for all of them to help admins to monitor and operate the infrastructure specifically in the bare metal case.

      Example:

      Alert: Node is NotReady for 5min
      Meaning: The status of the node is unknow - The workloads might still be running or not.
      Impact: The node is not providing any compute resources anymore, no workloads can e running on it anymore.
      Diagnosis: Check `kubectl …` …
      Mitigation:
      There can be several causes for NotReady nodes, some common operations to restore health are:
      
       * Reboot the node
       * If available: Automatically reprovision the node - or manually
      
      Before doign destructive operations you might also want to ensure to capture the relevant logs and events to perform a post-mortem.
      

      More exmaples can be found at https://github.com/kubevirt/monitoring/blob/main/runbooks/VirtAPIDown.md

            Unassigned Unassigned
            fdeutsch@redhat.com Fabian Deutsch
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: