Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30915

[4.15] CEO aliveness check should only detect deadlocks


    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.15.z
    • 4.14.z, 4.15.z, 4.16.0
    • Etcd
    • None
    • Important
    • No
    • Proposed
    • False
    • Hide



      This is a clone of issue OCPBUGS-30873. The following is the description of the original issue:

      Description of problem:

      From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result.
      Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. 
      Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time.
      [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

      Version-Release number of selected component (if applicable):


      How reproducible:


      Steps to Reproduce:

          1. create a healthy cluster
          2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
          3. wait for the CEO to restart pod on failing health probe and dump its stack

      Actual results:

      CEO controllers are returning errors, but might not deadlock, which currently results in a restart

      Expected results:

      CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

      Additional info:

      highly related to OCPBUGS-30169

            dwest@redhat.com Dean West
            openshift-crt-jira-prow OpenShift Prow Bot
            ge liu ge liu
            0 Vote for this issue
            6 Start watching this issue