Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62698

[4.18] etcdmemberscontroller health check declares all members unhealthy

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.16, 4.17, 4.18, 4.19, 4.20
    • Etcd
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • In Progress
    • Bug Fix
    • An etcd member that times out to respond after 30s would be declared unhealthy by the cluster-etcd-operator. Before this fix, all other healthy etcd members would also be defined unhealthy due to a shared timeout context.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-61019. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-60941. The following is the description of the original issue:

      Description of problem:

      In the face of a timeout reaching one member, the entire context deadlines exceeds and declares all members unhealthy, even though the other members are potentially reachable:
      
      [etcd-operator-5d5946d6c-gjxp9] E0827 10:17:04.841588       1 health.go:115] health check for member (tjungblu15-dq6nb-master-1) failed: err(context deadline exceeded)
      [etcd-operator-5d5946d6c-gjxp9] E0827 10:17:04.842259       1 etcdmemberscontroller.go:81] Unhealthy etcd member found: tjungblu15-dq6nb-master-2, took=29.996749436s, err=health check failed: context deadline exceeded
      [etcd-operator-5d5946d6c-gjxp9] E0827 10:17:04.842496       1 etcdmemberscontroller.go:81] Unhealthy etcd member found: tjungblu15-dq6nb-master-0, took=21.365µs, err=health check failed: context deadline exceeded
      [etcd-operator-5d5946d6c-gjxp9] E0827 10:17:04.842605       1 etcdmemberscontroller.go:81] Unhealthy etcd member found: tjungblu15-dq6nb-master-1, took=33.136µs, err=health check failed: context deadline exceeded
      
          

      Version-Release number of selected component (if applicable):

      any supported version    

      How reproducible:

      always    

      Steps to Reproduce:

          1. make etcd unresponsive (eg. by defrag on a large db size)
          2. wait for the health check on CEO to timeout against that etcd member
          3. observe the operator status to flag all three members as unhealthy
          

      Actual results:

          

      Expected results:

          

      Additional info:

      [etcd-operator-5d5946d6c-gjxp9] E0827 10:17:04.841588       1 health.go:115] health check for member (tjungblu15-dq6nb-master-1) failed: err(context deadline exceeded)
      [etcd-operator-5d5946d6c-gjxp9] E0827 10:17:04.842259       1 etcdmemberscontroller.go:81] Unhealthy etcd member found: tjungblu15-dq6nb-master-2, took=29.996749436s, err=health check failed: context deadline exceeded
      [etcd-operator-5d5946d6c-gjxp9] E0827 10:17:04.842496       1 etcdmemberscontroller.go:81] Unhealthy etcd member found: tjungblu15-dq6nb-master-0, took=21.365µs, err=health check failed: context deadline exceeded
      [etcd-operator-5d5946d6c-gjxp9] E0827 10:17:04.842605       1 etcdmemberscontroller.go:81] Unhealthy etcd member found: tjungblu15-dq6nb-master-1, took=33.136µs, err=health check failed: context deadline exceeded
      
          

              dwest@redhat.com Dean West
              tjungblu@redhat.com Thomas Jungblut
              None
              None
              Sandeep Kundu Sandeep Kundu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: