Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-342

[etcd-operator] etcd timers selectable profiles (TechPreview)

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • OCPSTRAT-16OpenShift - Kubernetes and Core Platform
    • 0% To Do, 0% In Progress, 100% Done
    • 0
    • 0
    • Program Call

      Feature Overview  

      etcd-operator API to set ETCD_ELECTION_TIMEOUT and ETCD_HEARTBEAT_INTERVAL 

      OCP4 does not have a way to tune the etcd parameters like timeout, heartbeat intervals, etc. Adjusting these parameters indiscriminately may compromise the stability of the control plane. In scenarios where disk IOPS are not ideal (e.g. disk degradation, storage providers in Cloud environments) this parameters could be adjusted to improve stability of the control plane while raising the corresponding warning notifications.

      In the past:

      The current default values on a 4.10 deployment
      ```
      name: ETCD_ELECTION_TIMEOUT
      value: "1000"
      name: ETCD_ENABLE_PPROF
      value: "true"
      name: ETCD_EXPERIMENTAL_MAX_LEARNERS
      value: "3"
      name: ETCD_EXPERIMENTAL_WARNING_APPLY_DURATION
      value: 200ms
      name: ETCD_EXPERIMENTAL_WATCH_PROGRESS_NOTIFY_INTERVAL
      value: 5s
      name: ETCD_HEARTBEAT_INTERVAL
      value: "100"
      ```
      and these are modified for exceptions of specific cloud providers (https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdenvvar/etcd_env.go#L232-L254).

      The guidance for latency among control plane nodes do not translate well to on-premise live scenarios https://access.redhat.com/articles/3220991

      Goals (aka. expected user outcomes)

      Understanding the need of exceptions and to remove the etcd-operator from having to maintain special settings for different deployment modes, this Feature is for:

      • Defining etcd-operator API to provide the cluster-admin the ability to set `ETCD_ELECTION_TIMEOUT` and `ETCD_HEARTBEAT_INTERVAL` within certain range.
         

        Requirements (aka. Acceptance Criteria):

      • The feature should protect the cluster from settings of these parameters to values outside good practices (https://etcd.io/docs/v3.5/tuning/)
      • Should enforce consistent range correlation between the two values
      • Should enforce parameters range to work on a network with a max RTT of 150ms
      • Should log clear warnings when the parameters are modified (it should be visible by must-gather)

      Out of Scope

      • Setting of any other etcd parameters
         

        Documentation Considerations

      The documentation should provide clear guidance on considerations when modifying the parameters and warnings of potential risks.

       

      Update June 2023:

      • The work will use the concept of validated selectable etcd profiles to deliver this functionality

            wcabanba@redhat.com William Caban
            wcabanba@redhat.com William Caban
            Wei Sun Wei Sun
            Matthew Werner Matthew Werner
            David Eads David Eads
            Eric Rich Eric Rich
            Votes:
            1 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: