Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18149

kube-apiserver etcd storage retry broken on slow platforms

XMLWordPrintable

    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None

      "etcdserver: leader changed" causes clients to fail.

      This error should never bubble up to clients because the kube-apiserver can always retry this failure mode since it knows the data was not modified. When etcd adjusts timeouts for leader election and heartbeating for slow hardware like Azure, the hardcoded timeouts in the kube-apiserver/etcd fail. See

      1. kube-apiserver tries to use etcd retries: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go#L308-L317
      2. etcd retries appear to be unconditionally added: https://github.com/etcd-io/etcd/blob/main/client/v3/client.go#L243-L249 and https://github.com/etcd-io/etcd/blob/release-3.5/client/v3/client.go#L286
      3. etcd retries retry a max of 2.5 seconds: https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L53 + https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L45
      4. etcd retries are further reduced by zero-second retry on quorum
      5. On azure https://github.com/openshift/cluster-etcd-operator/blob/d7d43ee21aff6b178b2104228bba374977777a84/pkg/etcdenvvar/etcd_env.go#L229 slower leader change reactions https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/hwspeedhelpers/hwhelper.go#L28 mean we are likely to exceed the number of retries for requests near the beginning of a change

      Simply saying, "oh, it's hardcoded and kube" isn't good enough. We have previously had a storage shim to retry such problems. If all else fails, bringing back the small shim to retry Unavailable etcd errors longer is appropriate to fix all available clients.

      Additionally, this etcd capability is being made more widely available and this bug prevents that from working.

            lszaszki@redhat.com Lukasz Szaszkiewicz
            deads@redhat.com David Eads
            Ke Wang Ke Wang
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: