Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-7643

Add Alerts/Metrics to track delays or failures in etcd compaction/defrag processes in RHOCP4

XMLWordPrintable

    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed Title of This Feature Request
      Alerting and Visibility for etcd Compaction and Auto-Defrag Failures in OpenShift Clusters

      2. What Is the Nature and Description of the Request?

      We have encountered an issue where, due to a failure in etcd’s compaction or defragmentation processes, the etcd database grows unchecked, causing potential cluster health degradation.
      The compaction process is crucial for maintaining the health and performance of the etcd database, and failures in this process should be detectable before they lead to performance issues.
      Currently, there are no built-in alerts or metrics to notify users when compaction or defragmentation is delayed or stopped, even though etcd’s database size can grow significantly over time.

      Actual results:
      – etcd compaction halts silently with no recovery for months
      – No alert or metric is triggered to indicate missed compaction
      – Metrics like etcd_debugging_mvcc_db_compaction_last are available but do not raise alerts or show health degradation explicitly
      – Grafana dashboards and Prometheus data falsely indicate all is well, even when compaction hasn't occurred in weeks

      Expected results:
      There should be an alert or metric to notify administrators when the following occurs: – Compaction is delayed or stopped.
      – Compaction takes longer than expected to complete.
      – Automatic Defragmentation is paused or fails to run, preventing the database from reclaiming space.
      – Any failure in the compaction process should trigger a warning, and this can be correlated with potential system instability. Enhance default cluster monitoring to provide a warning before database exhaustion due to missing compaction.

      The expectations from this RFE is to introduce built-in alerts in OpenShift for:
      – Stalled/delayed etcd compaction
      – Failed/missed auto defrag
      – etcd database file size exceeding thresholds without corresponding compaction. Expose metrics such as:
      – etcd_compaction_status: [OK, Delayed, Failed]
      – etcd_defrag_status: [OK, Skipped, Failed]
      – etcd_compaction_delay_seconds: time since last compaction
      Provide best practices or support for users to configure Prometheus alerts based on these metrics
      Improve documentation around etcd observability and maintenance visibility
      Surface compaction status in OpenShift console's cluster operator UI or etcd operator logs

      3. Why Does the Customer Need This? (Business Requirements)

      Critical impact of this issue :
      – When etcd compaction or defragmentation silently stops, the etcd database grows uncontrollably until it exhausts disk space. At that point, etcd stops serving traffic, leading to a cluster-wide outage, impacting all workloads and control plane operations.
      – This makes it impossible to detect proactively without deep manual inspection of etcd logs.
      – This creates a false sense of security in cluster health dashboards and delays any mitigation effort until it’s too late.
      – A silent failure of such a foundational component undermines confidence in OpenShift itself.

      4. List Any Affected Packages or Components

      • Component: etcd
      • Component: cluster-etcd-operator
      • Component: Monitoring / Prometheus / Alertmanager
      • Potentially related: ClusterOperator status reporting, oc adm inspect, must-gather, Telemetry

      Additional details :

      How reproducible -
      Occasionally reproducible under the following conditions:
      – etcd node undergoes ungraceful reboot
      – etcd peer communication becomes intermittent or degraded
      – etcd deprioritizes compaction/defragmentation due to internal system pressure or instability
      – Maintenance is silently skipped and not resumed automatically

      Steps to reproduce -
      1. Start with a healthy OCP cluster with normal etcd compaction behavior
      2. Restart an etcd node (e.g., via crash, rolling update, or ungraceful reboot)
      3. Introduce raft peer communication instability or temporary network disruption
      4. Wait and observe that etcd compaction silently stops
      5. Verify metric etcd_debugging_mvcc_db_compaction_last has not updated for days/weeks
      6. Check that no alerts are generated, and dashboards misleadingly show the cluster as healthy

              racedoro@redhat.com Ramon Acedo
              rhn-support-sdharma Suruchi Dharma
              None
              Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                None
                None