Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: openshift-4.13, openshift-4.14, openshift-4.15, openshift-4.16, openshift-4.17, openshift-4.18, openshift-4.19
Component/s: etcd
Labels:
- alerting
- cee.neXT
- etcd
- performance

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
Intelligence Requested:
Market:
PX Impact Score:
PX Priority Data:
PX Review Complete:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

1.Proposed title of this feature request

Adjust thresholds for `etcdHighFsyncDurations` and `etcdHighCommitDurations` alerts

2.What is the nature and description of the request?

In the past, it was opened the bug OCPBUGSM-33506 [1] for reflecting that the threshold for the `etcdHighFsyncDurations` and `etcdHighCommitDurations` alerts were so far from the recommended values [2][3].

The recommended values are:

etcd_disk_backend_commit_duration_seconds_bucket is having p99 <25ms (0.025)
etcd_disk_wal_fsync_duration_seconds_bucket p99 < 10ms (0.010)

But the current alerts in the OpenShift cluster are using much higher values.

For the `etcdHighCommitDurations` alert the recommended value is less than 0.025, but the alert is triggered when exceeding 10x

alert: etcdHighCommitDurations:  histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.25

For the `etcdHighFsyncDurations` critical alert the recommended value is `0.010`, but, the alert is triggered when exceeding 100x the recommended value

alert: etcdHighFsyncDurations (critical) histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 1

For the `etcdHighFsyncDurations` warning alert the recommended value is `0.010`, but, the alert is triggered when exceeding 5x the recommended value

alert: etcdHighFsyncDurations (warning) histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.05 (the good value should be less than 0.010

These alerts and their thresholds are coming from kubernetes upstream, but they are not valid for a healthy OpenShift production environment as when the alert is triggered, as for example, when exceeding 100x the supported value for `etcdHighFsyncDurations` the cluster is already in a problem, big enough to knock out the monitoring stack in most cases.

In addition, customers may not be supported if their etcd metrics are not within the supported thresholds. Our alerts must be able to report that situation early to the customer, so they don’t have to find out when a production down situation happens.

Then, the proposal is modifying the current thresholds or creating new "OpenShift" alerts for the recommended values for an "OpenShift" cluster.

{}NOTE{}: in case that a customer is willing to accept a higher latency in an environment, then, starting in OpenShift 4.14 is delivered the feature for being able to overwrite the thresholds of the alerts, but by default, out of the box, an Openshift cluster should deliver alerts "warning", when exceeding the "recommended values" and a "critical" when exceeding for example 2x that recommended values.

3.Why does the customer need this? (List the business requirements here)

The current threshold for alerts `etcdHighFsyncDurations` and `etcdHighCommitDurations` are not helpful for identifying really an issue in the cluster.

4.List any affected packages or components.

Monitoring, etcd

[1] https://issues.redhat.com/browse/OCPBUGSM-33506
[2] https://access.redhat.com/solutions/4770281
[3] https://etcd.io/docs/v3.4/faq/#performance

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

etcd.png
56 kB
2025/02/19 3:59 PM

is caused by

OCPSTRAT-1706 Document OpenShift Control Plane

In Progress

relates to

OCPBUGS-46582 Test integration with other RH products to determine etcd performance in OCP cluster

Closed

links to

[KCS] etcd alert rules are not proper as per etcd performance requirement

Assignee:: Ramon Acedo

Reporter:: Oscar Casal Sanchez

Votes:: 10 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2024/03/25 9:27 AM

Updated:: 2025/02/19 4:09 PM

Details

Description

1.Proposed title of this feature request

2.What is the nature and description of the request?

3.Why does the customer need this? (List the business requirements here)

4.List any affected packages or components.

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates