Description of problem:
For certain operations the CEO will check the etcd member health by creating a client directly and waiting for its status report. Under a situation of any member not being reachable for a longer period, we found the CEO was constantly getting stuck / deadlocked and couldn't move certain controllers forward. In OCPBUGS-12475 we introduced a health-check that would dump stack and automatically restart with the operator deployment health probe. In a more recent upgrade run we could find the culprit [1] to be a missing context during client initialization to etcd, making it stuck infinitely: W0229 02:55:46.820529 1 aliveness_checker.go:33] Controller [EtcdEndpointsController] didn't sync for a long time, declaring unhealthy and dumping stack goroutine 1426 [select]: github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth({0x3272768?, 0xc002090310}, {0xc0000a6880, 0x3, 0xc001c98360?}) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:64 +0x330 github.com/openshift/cluster-etcd-operator/pkg/etcdcli.(*etcdClientGetter).MemberHealth(0xc000c24540, {0x3272688, 0x4c20080}) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:412 +0x18c github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.CheckSafeToScaleCluster({0x324ccd0?, 0xc000b6d5f0?}, {0x3284250?, 0xc0008dda10?}, {0x324e6c0, 0xc000ed4fb0}, {0x3250560, 0xc000ed4fd0}, {0x32908d0, 0xc000c24540}) github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/bootstrap.go:149 +0x28e github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.(*QuorumCheck).IsSafeToUpdateRevision(0x2893020?) github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/qourum_check.go:37 +0x46 github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).syncConfigMap(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x32801b0, 0xc001198540}) github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:146 +0x5d8 github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).sync(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x325d240, 0xc003569e90}) github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:66 +0x71 github.com/openshift/cluster-etcd-operator/pkg/operator/health.(*CheckingSyncWrapper).Sync(0xc000f21bc0, {0x32726f8?, 0xc0008e60a0?}, {0x325d240?, 0xc003569e90?}) github.com/openshift/cluster-etcd-operator/pkg/operator/health/checking_sync_wrapper.go:22 +0x43 github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc00113cd80, {0x32726f8, 0xc0008e60a0}, {0x325d240?, 0xc003569e90?}) github.com/openshift/library-go@v0.0.0-20240124134907-4dfbf6bc7b11/pkg/controller/factory/base_controller.go:201 +0x43 goroutine 11640 [select]: google.golang.org/grpc.(*ClientConn).WaitForStateChange(0xc003707000, {0x3272768, 0xc002091260}, 0x3) google.golang.org/grpc@v1.58.3/clientconn.go:724 +0xb1 google.golang.org/grpc.DialContext({0x3272768, 0xc002091260}, {0xc003753740, 0x3c}, {0xc00355a880, 0x7, 0xc0023aa360?}) google.golang.org/grpc@v1.58.3/clientconn.go:295 +0x128e go.etcd.io/etcd/client/v3.(*Client).dial(0xc000895180, {0x32754a0?, 0xc001785670?}, {0xc0017856b0?, 0x28f6a80?, 0x28?}) go.etcd.io/etcd/client/v3@v3.5.10/client.go:303 +0x407 go.etcd.io/etcd/client/v3.(*Client).dialWithBalancer(0xc000895180, {0x0, 0x0, 0x0}) go.etcd.io/etcd/client/v3@v3.5.10/client.go:281 +0x1a9 go.etcd.io/etcd/client/v3.newClient(0xc002484e70?) go.etcd.io/etcd/client/v3@v3.5.10/client.go:414 +0x91c go.etcd.io/etcd/client/v3.New(...) go.etcd.io/etcd/client/v3@v3.5.10/client.go:81 github.com/openshift/cluster-etcd-operator/pkg/etcdcli.newEtcdClientWithClientOpts({0xc0017853d0, 0x1, 0x1}, 0x0, {0x0, 0x0, 0x0?}) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:127 +0x77d github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x32726f8, 0xc00318ac30}, 0xc002090460) github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:103 +0xc5 github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1() github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x6c created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth in goroutine 1426 github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5 [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/
Version-Release number of selected component (if applicable):
any currently supported OCP version
How reproducible:
Always
Steps to Reproduce:
1. create a healthy cluster 2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall) 3. wait for the CEO to restart pod on failing health probe and dump its stack (similar to the one above)
Actual results:
CEO controllers are getting deadlocked, but the operator will restart eventually after some time due to health probes failing
Expected results:
CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe
Additional info:
clientv3.New doesn't take any timeout context, but tries to establish a connection forever https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdcli/etcdcli.go#L127-L130 There's a way to pass the "default context" via the client config, which is slightly misleading.
- blocks
-
OCPBUGS-30300 CEO deadlocks on health checking a downed member
- Closed
- is cloned by
-
OCPBUGS-30300 CEO deadlocks on health checking a downed member
- Closed
-
OCPBUGS-30873 CEO aliveness check should only detect deadlocks
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update
(1 links to)