Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30873

CEO aliveness check should only detect deadlocks

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.16.0
    • 4.14.z, 4.15.z, 4.16.0
    • Etcd
    • None
    • +
    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the etcd Cluster Operator wrongly identified non-running controllers as deadlocked and this caused an unnecessary pod restart. With this release, this issue is now fixed so that the Operator marks a non-running controller as an unhealthy etcd member without restrarting a pod. (link:https://issues.redhat.com/browse/OCPBUGS-30873[*OCPBUGS-30873*])
      Show
      * Previously, the etcd Cluster Operator wrongly identified non-running controllers as deadlocked and this caused an unnecessary pod restart. With this release, this issue is now fixed so that the Operator marks a non-running controller as an unhealthy etcd member without restrarting a pod. (link: https://issues.redhat.com/browse/OCPBUGS-30873 [* OCPBUGS-30873 *])
    • Bug Fix
    • Done

      Description of problem:

      From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result.
      
      Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. 
      Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time.
      
      [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

      Version-Release number of selected component (if applicable):

      >4.14

      How reproducible:

      Always

      Steps to Reproduce:

          1. create a healthy cluster
          2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
          3. wait for the CEO to restart pod on failing health probe and dump its stack
          

      Actual results:

      CEO controllers are returning errors, but might not deadlock, which currently results in a restart

      Expected results:

      CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

      Additional info:

      highly related to OCPBUGS-30169

              dwest@redhat.com Dean West
              tjungblu@redhat.com Thomas Jungblut
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: