1.Proposed title of this feature request
Adjust thresholds for `etcdHighFsyncDurations` and `etcdHighCommitDurations` alerts
2.What is the nature and description of the request?
In the past, it was opened the bug OCPBUGSM-33506 [1] for reflecting that the threshold for the `etcdHighFsyncDurations` and `etcdHighCommitDurations` alerts were so far from the recommended values [2][3].
The recommended values are:
etcd_disk_backend_commit_duration_seconds_bucket is having p99 <25ms (0.025) etcd_disk_wal_fsync_duration_seconds_bucket p99 < 10ms (0.010)
But the current alerts in the OpenShift cluster are using much higher values.
For the `etcdHighCommitDurations` alert the recommended value is less than 0.025, but the alert is triggered when exceeding 10x
alert: etcdHighCommitDurations: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.25
For the `etcdHighFsyncDurations` critical alert the recommended value is `0.010`, but, the alert is triggered when exceeding 100x the recommended value
alert: etcdHighFsyncDurations (critical) histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 1
For the `etcdHighFsyncDurations` warning alert the recommended value is `0.010`, but, the alert is triggered when exceeding 5x the recommended value
alert: etcdHighFsyncDurations (warning) histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.05 (the good value should be less than 0.010
These alerts and their thresholds are coming from kubernetes upstream, but they are not valid for a healthy OpenShift production environment as when the alert is triggered, as for example, when exceeding 100x the supported value for `etcdHighFsyncDurations` the cluster is already in a problem, big enough to knock out the monitoring stack in most cases.
In addition, customers may not be supported if their etcd metrics are not within the supported thresholds. Our alerts must be able to report that situation early to the customer, so they don’t have to find out when a production down situation happens.
Then, the proposal is modifying the current thresholds or creating new "OpenShift" alerts for the recommended values for an "OpenShift" cluster.
{}NOTE{}: in case that a customer is willing to accept a higher latency in an environment, then, starting in OpenShift 4.14 is delivered the feature for being able to overwrite the thresholds of the alerts, but by default, out of the box, an Openshift cluster should deliver alerts "warning", when exceeding the "recommended values" and a "critical" when exceeding for example 2x that recommended values.
3.Why does the customer need this? (List the business requirements here)
The current threshold for alerts `etcdHighFsyncDurations` and `etcdHighCommitDurations` are not helpful for identifying really an issue in the cluster.
4.List any affected packages or components.
Monitoring, etcd
[1] https://issues.redhat.com/browse/OCPBUGSM-33506
[2] https://access.redhat.com/solutions/4770281
[3] https://etcd.io/docs/v3.4/faq/#performance