-
Bug
-
Resolution: Done-Errata
-
Major
-
4.14.z, 4.15.z, 4.16.0
-
None
-
+
-
Important
-
No
-
Rejected
-
False
-
-
-
Bug Fix
-
Done
Description of problem:
From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result. Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/
Version-Release number of selected component (if applicable):
>4.14
How reproducible:
Always
Steps to Reproduce:
1. create a healthy cluster 2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall) 3. wait for the CEO to restart pod on failing health probe and dump its stack
Actual results:
CEO controllers are returning errors, but might not deadlock, which currently results in a restart
Expected results:
CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe
Additional info:
highly related to OCPBUGS-30169
- blocks
-
OCPBUGS-30915 [4.15] CEO aliveness check should only detect deadlocks
- Closed
- clones
-
OCPBUGS-30169 CEO deadlocks on health checking a downed member
- Closed
- is cloned by
-
OCPBUGS-30915 [4.15] CEO aliveness check should only detect deadlocks
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update