Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61337

[4.17] Singular etcdDatabaseQuotaLowSpace critical PrometheusRule isn't sufficient

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • 4.17.z
    • 4.16, 4.17, 4.18, 4.19, 4.20, 4.21
    • Etcd
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • No
    • None
    • None
    • None
    • Done
    • Enhancement
    • Hide
      With this update, the `cluster-etcd-operator` Operator now implements a multi-stage notification system for the `etcdDatabaseQuotaLowSpace` alert to proactively manage etcd storage quotas. This enhancement is designed to prevent API server instability by providing earlier warnings of low database space. As etcd disk space usage reaches 65%, 75% and 85%, administrators now receive alerts with a severity level of info, warning, or critical. (link:https://issues.redhat.com/browse/OCPBUGS-61337[OCPBUGS-61337])
      Show
      With this update, the `cluster-etcd-operator` Operator now implements a multi-stage notification system for the `etcdDatabaseQuotaLowSpace` alert to proactively manage etcd storage quotas. This enhancement is designed to prevent API server instability by providing earlier warnings of low database space. As etcd disk space usage reaches 65%, 75% and 85%, administrators now receive alerts with a severity level of info, warning, or critical. (link: https://issues.redhat.com/browse/OCPBUGS-61337 [ OCPBUGS-61337 ])
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-60443. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-60237. The following is the description of the original issue:

      Description of problem:

      There is a single alert bundled with cluster-etcd-operator called etcdDatabaseQuotaLowSpace that alerts when a cluster is using 95% of it's etcd quota. This alert is often too late, as seen by Managed OpenShift, and doesn't allow administrators enough time to correct issues before the API server is impacfted.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      Very

      Steps to Reproduce:

          1.Make a Managed Openshift (or OCP cluster) with default control plane size and default 8Gb quota.
          2.Write a loop to create lots of big secrets or configmaps.

      Actual results:

      The API server is unstable and the only solution is to resize the control plane (or pods backing etcd if in HCP), perform a defrag and try to get back in to delete resources.

      Expected results:

      Cluster administrators are alerted at info, warning, and then critical levels for etcdDatabaseQuotaLowSpace.

      Additional info:

          

              dwest@redhat.com Dean West
              jbranham.openshift Josh Branham
              None
              None
              Sandeep Kundu Sandeep Kundu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: