Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18942

Etcd restore procedure should ask to stop the non-recovery control plane nodes keepalived

XMLWordPrintable

    • Important
    • No
    • 5
    • OSDOCS Sprint 248
    • 1
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Description of problem:

      During etcd restore, we had problems because the keepalived VIP was assigned to a non-recovery control plane node. In order to be safe, we should ensure that this never happens.
      
      The way to fix would be:
      - On step 4 (steps to be run in non-recovery control plane nodes), add the following sub-steps after substep 4f
        - If "/etc/kubernetes/manifests/keepalived.yaml" file exists, move it way with this command: "mv /etc/kubernetes/manifests/keepalived.yaml /root"
        - Wait until there is no running keepalived container as per this command output "crictl ps --name keepalived"
        - Check that the control plane has no VIP assigned as per "ip -o | grep -E '<apiVIP>|<ingressVIP>". For each VIP reported by the command above, run "ip address del <reportedWrongVIP> dev <deviceOfTheWrongReportedVIP>"
      
      - After step 5, add a step to double-check that the recovery control plane node owns the VIP by running "ip -o | grep -E '<apiVIP>'"
      
      Note that we don't have to start the keepalived static pods again because the non-recovery control plane nodes will be replaced and the replacement nodes will have the static pods in place.
      

      Version-Release number of selected component (if applicable):

      All currently supported versions
      

      How reproducible:

      Sometimes
      

      Steps to Reproduce:

      1. Read docs
      2.
      3.
      

      Actual results:

      Wrong docs
      

      Expected results:

      Right docs
      

      Additional info:

      The exact problem we found was reported at https://issues.redhat.com/browse/OCPBUGS-18940 . It is still worth that they fix the keepalived to take machine-config-server health into consideration, but we also should not allow keepalived proxy to stay running in nodes that are going to be no longer part of the cluster.
      

            kowen@redhat.com Kevin Owen
            rhn-support-palonsor Pablo Alonso Rodriguez
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: