Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18149

kube-apiserver etcd storage retry broken on slow platforms

    XMLWordPrintable

Details

    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None

    Description

      "etcdserver: leader changed" causes clients to fail.

      This error should never bubble up to clients because the kube-apiserver can always retry this failure mode since it knows the data was not modified. When etcd adjusts timeouts for leader election and heartbeating for slow hardware like Azure, the hardcoded timeouts in the kube-apiserver/etcd fail. See

      1. kube-apiserver tries to use etcd retries: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go#L308-L317
      2. etcd retries appear to be unconditionally added: https://github.com/etcd-io/etcd/blob/main/client/v3/client.go#L243-L249 and https://github.com/etcd-io/etcd/blob/release-3.5/client/v3/client.go#L286
      3. etcd retries retry a max of 2.5 seconds: https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L53 + https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L45
      4. etcd retries are further reduced by zero-second retry on quorum
      5. On azure https://github.com/openshift/cluster-etcd-operator/blob/d7d43ee21aff6b178b2104228bba374977777a84/pkg/etcdenvvar/etcd_env.go#L229 slower leader change reactions https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/hwspeedhelpers/hwhelper.go#L28 mean we are likely to exceed the number of retries for requests near the beginning of a change

      Simply saying, "oh, it's hardcoded and kube" isn't good enough. We have previously had a storage shim to retry such problems. If all else fails, bringing back the small shim to retry Unavailable etcd errors longer is appropriate to fix all available clients.

      Additionally, this etcd capability is being made more widely available and this bug prevents that from working.

      Attachments

        Activity

          People

            lszaszki@redhat.com Lukasz Szaszkiewicz
            deads@redhat.com David Eads
            Ke Wang Ke Wang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: