Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.14
Component/s: kube-apiserver
Labels:
- api
- trt-incident

Regression:
No
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

"etcdserver: leader changed" causes clients to fail.

This error should never bubble up to clients because the kube-apiserver can always retry this failure mode since it knows the data was not modified. When etcd adjusts timeouts for leader election and heartbeating for slow hardware like Azure, the hardcoded timeouts in the kube-apiserver/etcd fail. See

kube-apiserver tries to use etcd retries: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go#L308-L317
etcd retries appear to be unconditionally added: https://github.com/etcd-io/etcd/blob/main/client/v3/client.go#L243-L249 and https://github.com/etcd-io/etcd/blob/release-3.5/client/v3/client.go#L286
etcd retries retry a max of 2.5 seconds: https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L53 + https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L45
etcd retries are further reduced by zero-second retry on quorum
On azure https://github.com/openshift/cluster-etcd-operator/blob/d7d43ee21aff6b178b2104228bba374977777a84/pkg/etcdenvvar/etcd_env.go#L229 slower leader change reactions https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/hwspeedhelpers/hwhelper.go#L28 mean we are likely to exceed the number of retries for requests near the beginning of a change

Simply saying, "oh, it's hardcoded and kube" isn't good enough. We have previously had a storage shim to retry such problems. If all else fails, bringing back the small shim to retry Unavailable etcd errors longer is appropriate to fix all available clients.

Additionally, this etcd capability is being made more widely available and this bug prevents that from working.

links to

openshift/kubernetes#1676: OCPBUGS-18149: UPSTREAM: <carry>: retry etcd Unavailable errors

openshift/kubernetes#1681: OCPBUGS-18149: UPSTREAM: <carry>: retry etcd Unavailable errors

Assignee:: Lukasz Szaszkiewicz

Reporter:: David Eads

QA Contact:: Ke Wang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2023/08/25 7:48 PM

Updated:: 2024/04/29 5:02 PM

Resolved:: 2023/09/05 2:14 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates