-
Bug
-
Resolution: Done-Errata
-
Major
-
4.11
-
None
-
?
-
Yes
-
Rejected
-
False
-
Description of problem:
The fix for https://issues.redhat.com/browse/OCPBUGS-15947 seems to have introduced a problem in our keepalived-monitor logic. What I'm seeing is that at some point all of the apiservers became unavailable, which caused haproxy-monitor to drop the redirect firewall rule since it wasn't able to reach the API and we normally want to fall back to direct, un-loadbalanced API connectivity in that case.
However, due to the fix linked above we now short-circuit the keepalived-monitor update loop if we're unable to retrieve the node list, which is what will happen if the node holding the VIP has neither a local apiserver nor the HAProxy firewall rule. Because of this we will also skip updating the status of the firewall rule and thus the keepalived priority for the node won't be dropped appropriately.
Version-Release number of selected component (if applicable):
We backported the fix linked above to 4.11 so I expect this goes back at least that far.
How reproducible:
Unsure. It's clearly not happening every time, but I have a local dev cluster in this state so it can happen.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
I think the solution here is just to move the firewall rule check earlier in the update loop so it will have run before we try to retrieve nodes. There's no dependency on the ordering of those two steps so I don't foresee any major issues. To workaround this I believe we can just bounce keepalived on the affected node until the VIP ends up on the node with a local apiserver.
- blocks
-
OCPBUGS-18582 API VIP stuck on node with inaccessible API
- Closed
- is cloned by
-
OCPBUGS-18582 API VIP stuck on node with inaccessible API
- Closed
- links to
-
RHSA-2023:5006 OpenShift Container Platform 4.14.z security update
- mentioned on