Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30169

CEO deadlocks on health checking a downed member

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Major Major
    • 4.16.0
    • 4.13.z, 4.12.z, 4.14.z, 4.15.z, 4.16.0
    • Etcd
    • None
    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      For certain operations the CEO will check the etcd member health by creating a client directly and waiting for its status report.
      
      Under a situation of any member not being reachable for a longer period, we found the CEO was constantly getting stuck / deadlocked and couldn't move certain controllers forward. 
      
      In OCPBUGS-12475 we introduced a health-check that would dump stack and automatically restart with the operator deployment health probe.
      
      In a more recent upgrade run we could find the culprit [1] to be a missing context during client initialization to etcd, making it stuck infinitely:
      
      
      W0229 02:55:46.820529       1 aliveness_checker.go:33] Controller [EtcdEndpointsController] didn't sync for a long time, declaring unhealthy and dumping stack
      
      goroutine 1426 [select]:
      github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth({0x3272768?, 0xc002090310}, {0xc0000a6880, 0x3, 0xc001c98360?})
      	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:64 +0x330
      github.com/openshift/cluster-etcd-operator/pkg/etcdcli.(*etcdClientGetter).MemberHealth(0xc000c24540, {0x3272688, 0x4c20080})
      	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:412 +0x18c
      github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.CheckSafeToScaleCluster({0x324ccd0?, 0xc000b6d5f0?}, {0x3284250?, 0xc0008dda10?}, {0x324e6c0, 0xc000ed4fb0}, {0x3250560, 0xc000ed4fd0}, {0x32908d0, 0xc000c24540})
      	github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/bootstrap.go:149 +0x28e
      github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.(*QuorumCheck).IsSafeToUpdateRevision(0x2893020?)
      	github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/qourum_check.go:37 +0x46
      github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).syncConfigMap(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x32801b0, 0xc001198540})
      	github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:146 +0x5d8
      github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).sync(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x325d240, 0xc003569e90})
      	github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:66 +0x71
      github.com/openshift/cluster-etcd-operator/pkg/operator/health.(*CheckingSyncWrapper).Sync(0xc000f21bc0, {0x32726f8?, 0xc0008e60a0?}, {0x325d240?, 0xc003569e90?})
      	github.com/openshift/cluster-etcd-operator/pkg/operator/health/checking_sync_wrapper.go:22 +0x43
      github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc00113cd80, {0x32726f8, 0xc0008e60a0}, {0x325d240?, 0xc003569e90?})
      	github.com/openshift/library-go@v0.0.0-20240124134907-4dfbf6bc7b11/pkg/controller/factory/base_controller.go:201 +0x43
      
      
      
      goroutine 11640 [select]:
      google.golang.org/grpc.(*ClientConn).WaitForStateChange(0xc003707000, {0x3272768, 0xc002091260}, 0x3)
      	google.golang.org/grpc@v1.58.3/clientconn.go:724 +0xb1
      google.golang.org/grpc.DialContext({0x3272768, 0xc002091260}, {0xc003753740, 0x3c}, {0xc00355a880, 0x7, 0xc0023aa360?})
      	google.golang.org/grpc@v1.58.3/clientconn.go:295 +0x128e
      go.etcd.io/etcd/client/v3.(*Client).dial(0xc000895180, {0x32754a0?, 0xc001785670?}, {0xc0017856b0?, 0x28f6a80?, 0x28?})
      	go.etcd.io/etcd/client/v3@v3.5.10/client.go:303 +0x407
      go.etcd.io/etcd/client/v3.(*Client).dialWithBalancer(0xc000895180, {0x0, 0x0, 0x0})
      	go.etcd.io/etcd/client/v3@v3.5.10/client.go:281 +0x1a9
      go.etcd.io/etcd/client/v3.newClient(0xc002484e70?)
      	go.etcd.io/etcd/client/v3@v3.5.10/client.go:414 +0x91c
      go.etcd.io/etcd/client/v3.New(...)
      	go.etcd.io/etcd/client/v3@v3.5.10/client.go:81
      github.com/openshift/cluster-etcd-operator/pkg/etcdcli.newEtcdClientWithClientOpts({0xc0017853d0, 0x1, 0x1}, 0x0, {0x0, 0x0, 0x0?})
      	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:127 +0x77d
      github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x32726f8, 0xc00318ac30}, 0xc002090460)
      	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:103 +0xc5
      github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
      	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x6c
      created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth in goroutine 1426
      	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5
      
        
      
      [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

      Version-Release number of selected component (if applicable):

      any currently supported OCP version    

      How reproducible:

      Always    

      Steps to Reproduce:

          1. create a healthy cluster
          2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
          3. wait for the CEO to restart pod on failing health probe and dump its stack (similar to the one above)
          

      Actual results:

      CEO controllers are getting deadlocked, but the operator will restart eventually after some time due to health probes failing    

      Expected results:

      CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

      Additional info:

      clientv3.New doesn't take any timeout context, but tries to establish a connection forever
      
      https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdcli/etcdcli.go#L127-L130
      
      There's a way to pass the "default context" via the client config, which is slightly misleading.
      
      

            tjungblu@redhat.com Thomas Jungblut
            tjungblu@redhat.com Thomas Jungblut
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: