Uploaded image for project: 'OpenShift Top Level Product Strategy'
  1. OpenShift Top Level Product Strategy
  2. OCPPLAN-7569

autotune etcd runtime based on prometheus observations

XMLWordPrintable

    • Icon: Feature Feature
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • Etcd
    • Etcd
    • No

      In times of duress or increased workloads, etcd's default runtime will result in leader elections and degraded cluster performance. Today we offer no day 2 knobs for adjusting these values. While the user is certainly capable of adjusting etcd runtime the operator should provide a controller that performs these actions on the behalf of the user.  This reduces the knobs available and uses actual metrics data to drive decisions.

      design problems:

      • avoiding making the situation worse: one of the issues with this approach is that in order to change these values we must role out a new revision of etcd. This process involves restarting etcd and generating a leader change. What we don't want is to make the cluster which is already under duress worse. One approach would be to define a safe operating level. By this I mean even if the controller concludes that the current etcd tuning is not adequate we will not roll out a new revision until the cluster has recovered to a safe level. For example cluster A receives spike in workload, the resulting workload bumps p99 for `etcd_disk_wal_fsync_duration_seconds_bucket` goes over 200ms for 6 hours. While this event theoretically should bump etcd runtime we should consider not actually performing this action until the p99 returns to a reasonable level. The cluster could wait for this pressure to subside and then roll the new revision. The issue here is added complexity crafting a query that will allow this.
      • flapping: we do not want to tune up because of observations and then 2 hours later tune back down. One solution is to only allow tuning up. So the cluster has observed that in general, you current runtime can not support the cluster workload. The resulting bump will mitigate the issue but we must also provide the admin with details about the problem. For example, perhaps because of the new workload it would make sense to bump to a larger instance type with more resources or a more performant disk. Admin should explore these solutions as an actual remedy.
      • tune levels: at max 3 levels should probably exist.

      runtime tunables:

      • ETCD_ELECTION_TIMEOUT
      • ETCD_HEARTBEAT_INTERVAL

       

      bugs: 
      https://bugzilla.redhat.com/show_bug.cgi?id=1832261

              Unassigned Unassigned
              rhn-support-dahernan David Hernandez Fernandez
              Votes:
              11 Vote for this issue
              Watchers:
              29 Start watching this issue

                Created:
                Updated: