Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-1100

Create warning alerts to prevent users from reaching disk watermark thresholds

    XMLWordPrintable

Details

    • Logging (LogExp) - Sprint 201
    • 2
    • Passed
    • NEW
    • Hide
      This release adds the following new `ElasticsearchNodeDiskWatermarkReached` warnings to the OpenShift Elasticsearch Operator (EO):
       - Elasticsearch Node Disk Low Watermark Reached
       - Elasticsearch Node Disk High Watermark Reached
       - Elasticsearch Node Disk Flood Watermark Reached

      The EO issues these warnings when it predicts that an Elasticsearch node will reach the `Disk Low Watermark`, `Disk High Watermark`, or `Disk Flood Stage Watermark` thresholds in the next 6 hours. This warning period gives you time to respond before the node reaches the disk watermark thresholds. The warning messages also provide links to the troubleshooting steps, which you can follow to help mitigate the issue. The EO applies the past several hours of disk space data to a linear model to generate these warnings.
      Show
      This release adds the following new `ElasticsearchNodeDiskWatermarkReached` warnings to the OpenShift Elasticsearch Operator (EO):  - Elasticsearch Node Disk Low Watermark Reached  - Elasticsearch Node Disk High Watermark Reached  - Elasticsearch Node Disk Flood Watermark Reached The EO issues these warnings when it predicts that an Elasticsearch node will reach the `Disk Low Watermark`, `Disk High Watermark`, or `Disk Flood Stage Watermark` thresholds in the next 6 hours. This warning period gives you time to respond before the node reaches the disk watermark thresholds. The warning messages also provide links to the troubleshooting steps, which you can follow to help mitigate the issue. The EO applies the past several hours of disk space data to a linear model to generate these warnings.

    Description

      Currently we have alerts that will fire if the customers has already reached disk watermark thresholds. However, that means they would then have critical steps to take.

       

      We should adjust our alerts to give users a (warning) heads up that they would reach a threshold within a given amount of time based on the current trend.

       

      Notes:

      https://prometheus.io/docs/prometheus/latest/querying/functions/#predict_linear

      https://github.com/openshift/elasticsearch-operator/blob/master/files/prometheus_alerts.yml#L47

       

      Acceptance Criteria:

      • We provide a warning that the cluster will reach the low watermark threshold within a reasonable amount of time (6 hrs?)
      • We provide a more severe alert that the cluster will reach the high watermark threshold within a reasonable amount of time (6 hrs?)
      • We provide an actionable entry within the runbook for when the low watermark threshold will be met
      • We provide an actionable entry within the runbook for when the high watermark threshold will be met
      • Ensure that the alerts that currently exist inhibit these new alerts (so that we aren't getting multiple alerts for the same issue)
      • Create an initial unit test to test the linear prediction (since they will require ~1 hr of data to properly fire) https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/
        *

      Attachments

        Issue Links

          Activity

            People

              sasagarw@redhat.com Sashank Agarwal
              ewolinet@redhat.com Eric Wolinetz (Inactive)
              Qiaoling Tang Qiaoling Tang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: