Uploaded image for project: 'OpenShift Etcd'
  1. OpenShift Etcd
  2. ETCD-473

Make the etcd client retry parameters configurable for API server

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • None
    • BU Product Work
    • 5
    • False
    • None
    • False
    • OCPSTRAT-1243 - GA ETCD Tuning Profiles
    • ETCD Sprint 244, ETCD Sprint 245, ETCD Sprint 246, ETCD Sprint 247, ETCD Sprint 248

      See the following for background:
      https://issues.redhat.com/browse/OCPBUGS-18149

      The API server's etcd client requires more control over the retry configuration so that in the event of a period of expected unavailability e.g during leader elections, the client can keep retrying for longer. This problem is currently more prevalent on clusters that have a slower etcd tuning profile (longer leader election and heartbeat timeouts).

      The hardcoded retry defaults in the etcd client:
      https://github.com/etcd-io/etcd/blob/main/client/v3/client.go#L243-L249
      https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L45-L53

      And how the API server configures the dial options for the etcd client:
      https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go#L308-L317

      One potential approach would be to have the upstream etcd client switch to using grpc RetryPolicy and allow configurable options to be passed to set the policy when constructing the client:
      https://github.com/etcd-io/etcd/blob/main/client/v3/client.go#L241-L242
      https://github.com/grpc/grpc-proto/blob/cdd9ed5c3d3f87aef62f373b93361cf7bddc620d/grpc/service_config/service_config.proto#L130

        There are no Sub-Tasks for this issue.

            alray@redhat.com Allen Ray
            rhn-coreos-htariq Haseeb Tariq
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: