Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31428

[4.14] CEO aliveness check should only detect deadlocks

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.14.z
    • 4.14.z, 4.15.z, 4.16.0
    • Etcd

      This is a clone of issue OCPBUGS-30915. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-30873. The following is the description of the original issue:

      Description of problem:

      From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result.
      
      Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. 
      Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time.
      
      [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

      Version-Release number of selected component (if applicable):

      >4.14

      How reproducible:

      Always

      Steps to Reproduce:

          1. create a healthy cluster
          2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
          3. wait for the CEO to restart pod on failing health probe and dump its stack
          

      Actual results:

      CEO controllers are returning errors, but might not deadlock, which currently results in a restart

      Expected results:

      CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

      Additional info:

      highly related to OCPBUGS-30169

            dwest@redhat.com Dean West
            openshift-crt-jira-prow OpenShift Prow Bot
            Ge Liu Ge Liu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: