Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.12.0
Affects Version/s: 4.11
Component/s: Networking / runtime-cfg
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:

4.11.z
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-18582~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-18257~~. The following is the description of the original issue:
—
Description of problem:

The fix for https://issues.redhat.com/browse/OCPBUGS-15947 seems to have introduced a problem in our keepalived-monitor logic. What I'm seeing is that at some point all of the apiservers became unavailable, which caused haproxy-monitor to drop the redirect firewall rule since it wasn't able to reach the API and we normally want to fall back to direct, un-loadbalanced API connectivity in that case.

However, due to the fix linked above we now short-circuit the keepalived-monitor update loop if we're unable to retrieve the node list, which is what will happen if the node holding the VIP has neither a local apiserver nor the HAProxy firewall rule. Because of this we will also skip updating the status of the firewall rule and thus the keepalived priority for the node won't be dropped appropriately.

Version-Release number of selected component (if applicable):

We backported the fix linked above to 4.11 so I expect this goes back at least that far.

How reproducible:

Unsure. It's clearly not happening every time, but I have a local dev cluster in this state so it can happen.

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

I think the solution here is just to move the firewall rule check earlier in the update loop so it will have run before we try to retrieve nodes. There's no dependency on the ordering of those two steps so I don't foresee any major issues.

To workaround this I believe we can just bounce keepalived on the affected node until the VIP ends up on the node with a local apiserver.

clones

OCPBUGS-18606 API VIP stuck on node with inaccessible API

Closed

is blocked by

OCPBUGS-18606 API VIP stuck on node with inaccessible API

Closed

links to

openshift/baremetal-runtimecfg#273: [release-4.11] OCPBUGS-18815: Move haproxy firewall rule check earlier in loop

RHBA-2023:5350 OpenShift Container Platform 4.11.z bug fix update

Assignee:: Benjamin Nemec

Reporter:: OpenShift Prow Bot

QA Contact:: Zhanqi Zhao

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/09/11 7:42 PM

Updated:: 2025/07/25 5:30 PM

Resolved:: 2023/10/04 2:18 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates