-
Feature
-
Resolution: Done
-
None
Feature Overview
etcd-operator API to set ETCD_ELECTION_TIMEOUT and ETCD_HEARTBEAT_INTERVAL
OCP4 does not have a way to tune the etcd parameters like timeout, heartbeat intervals, etc. Adjusting these parameters indiscriminately may compromise the stability of the control plane. In scenarios where disk IOPS are not ideal (e.g. disk degradation, storage providers in Cloud environments) this parameters could be adjusted to improve stability of the control plane while raising the corresponding warning notifications.
In the past:
- There has been workarounds required as "one off" for Cloud providers (https://github.com/openshift/machine-config-operator/pull/1507) (https://github.com/openshift/cluster-etcd-operator/pull/218) to tune these parameters.
- There has been requests from community for tuning these:
(https://github.com/openshift/cluster-etcd-operator/pull/515) (https://github.com/openshift/cluster-etcd-operator/issues/499)
The current default values on a 4.10 deployment
```
name: ETCD_ELECTION_TIMEOUT
value: "1000"
name: ETCD_ENABLE_PPROF
value: "true"
name: ETCD_EXPERIMENTAL_MAX_LEARNERS
value: "3"
name: ETCD_EXPERIMENTAL_WARNING_APPLY_DURATION
value: 200ms
name: ETCD_EXPERIMENTAL_WATCH_PROGRESS_NOTIFY_INTERVAL
value: 5s
name: ETCD_HEARTBEAT_INTERVAL
value: "100"
```
and these are modified for exceptions of specific cloud providers (https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdenvvar/etcd_env.go#L232-L254).
The guidance for latency among control plane nodes do not translate well to on-premise live scenarios https://access.redhat.com/articles/3220991
Goals (aka. expected user outcomes)
Understanding the need of exceptions and to remove the etcd-operator from having to maintain special settings for different deployment modes, this Feature is for:
- Defining etcd-operator API to provide the cluster-admin the ability to set `ETCD_ELECTION_TIMEOUT` and `ETCD_HEARTBEAT_INTERVAL` within certain range.
Requirements (aka. Acceptance Criteria):
- The feature should protect the cluster from settings of these parameters to values outside good practices (https://etcd.io/docs/v3.5/tuning/)
- Should enforce consistent range correlation between the two values
- Should enforce parameters range to work on a network with a max RTT of 150ms
- Should log clear warnings when the parameters are modified (it should be visible by must-gather)
Out of Scope
The documentation should provide clear guidance on considerations when modifying the parameters and warnings of potential risks.
Update June 2023:
- The work will use the concept of validated selectable etcd profiles to deliver this functionality