Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30915

[4.15] CEO aliveness check should only detect deadlocks

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done-Errata
    • Major
    • 4.15.z
    • 4.14.z, 4.15.z, 4.16.0
    • Etcd
    • None
    • Important
    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None

    Description

      This is a clone of issue OCPBUGS-30873. The following is the description of the original issue:

      Description of problem:

      From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result.
      
      Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. 
      Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time.
      
      [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

      Version-Release number of selected component (if applicable):

      >4.14

      How reproducible:

      Always

      Steps to Reproduce:

          1. create a healthy cluster
          2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
          3. wait for the CEO to restart pod on failing health probe and dump its stack
          

      Actual results:

      CEO controllers are returning errors, but might not deadlock, which currently results in a restart

      Expected results:

      CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

      Additional info:

      highly related to OCPBUGS-30169

      Attachments

        Issue Links

          Activity

            People

              dwest@redhat.com Dean West
              openshift-crt-jira-prow OpenShift Prow Bot
              ge liu ge liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: