-
Story
-
Resolution: Done
-
Critical
-
None
-
None
-
None
-
BU Product Work
-
5
-
False
-
None
-
False
-
OCPSTRAT-1243 - GA ETCD Tuning Profiles
-
-
-
ETCD Sprint 244, ETCD Sprint 245, ETCD Sprint 246, ETCD Sprint 247, ETCD Sprint 248
See the following for background:
https://issues.redhat.com/browse/OCPBUGS-18149
The API server's etcd client requires more control over the retry configuration so that in the event of a period of expected unavailability e.g during leader elections, the client can keep retrying for longer. This problem is currently more prevalent on clusters that have a slower etcd tuning profile (longer leader election and heartbeat timeouts).
The hardcoded retry defaults in the etcd client:
https://github.com/etcd-io/etcd/blob/main/client/v3/client.go#L243-L249
https://github.com/etcd-io/etcd/blob/main/client/v3/options.go#L45-L53
And how the API server configures the dial options for the etcd client:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go#L308-L317
One potential approach would be to have the upstream etcd client switch to using grpc RetryPolicy and allow configurable options to be passed to set the policy when constructing the client:
https://github.com/etcd-io/etcd/blob/main/client/v3/client.go#L241-L242
https://github.com/grpc/grpc-proto/blob/cdd9ed5c3d3f87aef62f373b93361cf7bddc620d/grpc/service_config/service_config.proto#L130
- blocks
-
ETCD-488 Move etcd tuning profiles API from feature gates to GA
- Closed
- links to