-
Feature Request
-
Resolution: Unresolved
-
Critical
-
None
-
4.16, 4.18, 4.17
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
1. Proposed Title of This Feature Request
Alerting and Visibility for etcd Compaction and Auto-Defrag Failures in OpenShift Clusters
2. What Is the Nature and Description of the Request?
We have encountered an issue where, due to a failure in etcd’s compaction or defragmentation processes, the etcd database grows unchecked, causing potential cluster health degradation.
The compaction process is crucial for maintaining the health and performance of the etcd database, and failures in this process should be detectable before they lead to performance issues.
Currently, there are no built-in alerts or metrics to notify users when compaction or defragmentation is delayed or stopped, even though etcd’s database size can grow significantly over time.
Actual results:
– etcd compaction halts silently with no recovery for months
– No alert or metric is triggered to indicate missed compaction
– Metrics like etcd_debugging_mvcc_db_compaction_last are available but do not raise alerts or show health degradation explicitly
– Grafana dashboards and Prometheus data falsely indicate all is well, even when compaction hasn't occurred in weeks
Expected results:
There should be an alert or metric to notify administrators when the following occurs: – Compaction is delayed or stopped.
– Compaction takes longer than expected to complete.
– Automatic Defragmentation is paused or fails to run, preventing the database from reclaiming space.
– Any failure in the compaction process should trigger a warning, and this can be correlated with potential system instability. Enhance default cluster monitoring to provide a warning before database exhaustion due to missing compaction.
The expectations from this RFE is to introduce built-in alerts in OpenShift for:
– Stalled/delayed etcd compaction
– Failed/missed auto defrag
– etcd database file size exceeding thresholds without corresponding compaction. Expose metrics such as:
– etcd_compaction_status: [OK, Delayed, Failed]
– etcd_defrag_status: [OK, Skipped, Failed]
– etcd_compaction_delay_seconds: time since last compaction
Provide best practices or support for users to configure Prometheus alerts based on these metrics
Improve documentation around etcd observability and maintenance visibility
Surface compaction status in OpenShift console's cluster operator UI or etcd operator logs
3. Why Does the Customer Need This? (Business Requirements)
Critical impact of this issue :
– When etcd compaction or defragmentation silently stops, the etcd database grows uncontrollably until it exhausts disk space. At that point, etcd stops serving traffic, leading to a cluster-wide outage, impacting all workloads and control plane operations.
– This makes it impossible to detect proactively without deep manual inspection of etcd logs.
– This creates a false sense of security in cluster health dashboards and delays any mitigation effort until it’s too late.
– A silent failure of such a foundational component undermines confidence in OpenShift itself.
4. List Any Affected Packages or Components
- Component: etcd
- Component: cluster-etcd-operator
- Component: Monitoring / Prometheus / Alertmanager
- Potentially related: ClusterOperator status reporting, oc adm inspect, must-gather, Telemetry
Additional details :
How reproducible -
Occasionally reproducible under the following conditions:
– etcd node undergoes ungraceful reboot
– etcd peer communication becomes intermittent or degraded
– etcd deprioritizes compaction/defragmentation due to internal system pressure or instability
– Maintenance is silently skipped and not resumed automatically
Steps to reproduce -
1. Start with a healthy OCP cluster with normal etcd compaction behavior
2. Restart an etcd node (e.g., via crash, rolling update, or ungraceful reboot)
3. Introduce raft peer communication instability or temporary network disruption
4. Wait and observe that etcd compaction silently stops
5. Verify metric etcd_debugging_mvcc_db_compaction_last has not updated for days/weeks
6. Check that no alerts are generated, and dashboards misleadingly show the cluster as healthy