Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60237

Singular etcdDatabaseQuotaLowSpace critical PrometheusRule isn't sufficient

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • 4.20.0
    • 4.16, 4.17, 4.18, 4.19, 4.20, 4.21
    • Etcd
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • No
    • Done
    • Enhancement
    • Hide
      With this update, the Cluster etcd Operator introduces alert levels for the `etcdDatabaseQuotaLowSpace` alert, offering administrators timely notifications about low etcd quota usage. This proactive alert system aims to prevent API server instability and allows for effective resource management in managed OpenShift clusters. The alert levels are `info`, `warning`, and `critical`, providing a more granular approach to monitoring etcd quota usage, resulting in dynamic etcd quota management and improved overall cluster performance.
      Show
      With this update, the Cluster etcd Operator introduces alert levels for the `etcdDatabaseQuotaLowSpace` alert, offering administrators timely notifications about low etcd quota usage. This proactive alert system aims to prevent API server instability and allows for effective resource management in managed OpenShift clusters. The alert levels are `info`, `warning`, and `critical`, providing a more granular approach to monitoring etcd quota usage, resulting in dynamic etcd quota management and improved overall cluster performance.
    • None
    • None
    • None
    • None

      Description of problem:

      There is a single alert bundled with cluster-etcd-operator called etcdDatabaseQuotaLowSpace that alerts when a cluster is using 95% of it's etcd quota. This alert is often too late, as seen by Managed OpenShift, and doesn't allow administrators enough time to correct issues before the API server is impacfted.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      Very

      Steps to Reproduce:

          1.Make a Managed Openshift (or OCP cluster) with default control plane size and default 8Gb quota.
          2.Write a loop to create lots of big secrets or configmaps.

      Actual results:

      The API server is unstable and the only solution is to resize the control plane (or pods backing etcd if in HCP), perform a defrag and try to get back in to delete resources.

      Expected results:

      Cluster administrators are alerted at info, warning, and then critical levels for etcdDatabaseQuotaLowSpace.

      Additional info:

          

              dwest@redhat.com Dean West
              jbranham.openshift Josh Branham
              None
              None
              Sandeep Kundu Sandeep Kundu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: