-
Bug
-
Resolution: Done
-
Critical
-
4.13, 4.12, 4.11, 4.10, 4.14
-
Important
-
No
-
5
-
OSDOCS Sprint 248
-
1
-
False
-
-
N/A
-
Release Note Not Required
Description of problem:
During etcd restore, we had problems because the keepalived VIP was assigned to a non-recovery control plane node. In order to be safe, we should ensure that this never happens. The way to fix would be: - On step 4 (steps to be run in non-recovery control plane nodes), add the following sub-steps after substep 4f - If "/etc/kubernetes/manifests/keepalived.yaml" file exists, move it way with this command: "mv /etc/kubernetes/manifests/keepalived.yaml /root" - Wait until there is no running keepalived container as per this command output "crictl ps --name keepalived" - Check that the control plane has no VIP assigned as per "ip -o | grep -E '<apiVIP>|<ingressVIP>". For each VIP reported by the command above, run "ip address del <reportedWrongVIP> dev <deviceOfTheWrongReportedVIP>" - After step 5, add a step to double-check that the recovery control plane node owns the VIP by running "ip -o | grep -E '<apiVIP>'" Note that we don't have to start the keepalived static pods again because the non-recovery control plane nodes will be replaced and the replacement nodes will have the static pods in place.
Version-Release number of selected component (if applicable):
All currently supported versions
How reproducible:
Sometimes
Steps to Reproduce:
1. Read docs 2. 3.
Actual results:
Wrong docs
Expected results:
Right docs
Additional info:
The exact problem we found was reported at https://issues.redhat.com/browse/OCPBUGS-18940 . It is still worth that they fix the keepalived to take machine-config-server health into consideration, but we also should not allow keepalived proxy to stay running in nodes that are going to be no longer part of the cluster.