Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18606

API VIP stuck on node with inaccessible API

XMLWordPrintable

    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-18582. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-18257. The following is the description of the original issue:

      Description of problem:

      The fix for https://issues.redhat.com/browse/OCPBUGS-15947 seems to have introduced a problem in our keepalived-monitor logic. What I'm seeing is that at some point all of the apiservers became unavailable, which caused haproxy-monitor to drop the redirect firewall rule since it wasn't able to reach the API and we normally want to fall back to direct, un-loadbalanced API connectivity in that case.

      However, due to the fix linked above we now short-circuit the keepalived-monitor update loop if we're unable to retrieve the node list, which is what will happen if the node holding the VIP has neither a local apiserver nor the HAProxy firewall rule. Because of this we will also skip updating the status of the firewall rule and thus the keepalived priority for the node won't be dropped appropriately.

      Version-Release number of selected component (if applicable):

      We backported the fix linked above to 4.11 so I expect this goes back at least that far.

      How reproducible:

      Unsure. It's clearly not happening every time, but I have a local dev cluster in this state so it can happen.

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      I think the solution here is just to move the firewall rule check earlier in the update loop so it will have run before we try to retrieve nodes. There's no dependency on the ordering of those two steps so I don't foresee any major issues.
      
      To workaround this I believe we can just bounce keepalived on the affected node until the VIP ends up on the node with a local apiserver.

            bnemec@redhat.com Benjamin Nemec
            openshift-crt-jira-prow OpenShift Prow Bot
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: