Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.14.z
Affects Version/s: 4.13, 4.12, 4.11, 4.10, 4.14
Component/s: Documentation / etcd
Labels:
- pre-merge-tested

Severity:
Important
Regression:
No
Story Points:
5
Sprint:
OSDOCS Sprint 248
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

During etcd restore, we had problems because the keepalived VIP was assigned to a non-recovery control plane node. In order to be safe, we should ensure that this never happens.

The way to fix would be:
- On step 4 (steps to be run in non-recovery control plane nodes), add the following sub-steps after substep 4f
  - If "/etc/kubernetes/manifests/keepalived.yaml" file exists, move it way with this command: "mv /etc/kubernetes/manifests/keepalived.yaml /root"
  - Wait until there is no running keepalived container as per this command output "crictl ps --name keepalived"
  - Check that the control plane has no VIP assigned as per "ip -o | grep -E '<apiVIP>|<ingressVIP>". For each VIP reported by the command above, run "ip address del <reportedWrongVIP> dev <deviceOfTheWrongReportedVIP>"

- After step 5, add a step to double-check that the recovery control plane node owns the VIP by running "ip -o | grep -E '<apiVIP>'"

Note that we don't have to start the keepalived static pods again because the non-recovery control plane nodes will be replaced and the replacement nodes will have the static pods in place.

Version-Release number of selected component (if applicable):

All currently supported versions

How reproducible:

Sometimes

Steps to Reproduce:

1. Read docs
2.
3.

Actual results:

Wrong docs

Expected results:

Right docs

Additional info:

The exact problem we found was reported at https://issues.redhat.com/browse/OCPBUGS-18940 . It is still worth that they fix the keepalived to take machine-config-server health into consideration, but we also should not allow keepalived proxy to stay running in nodes that are going to be no longer part of the cluster.

Assignee:: Kevin Owen

Reporter:: Pablo Alonso Rodriguez

QA Contact:: ge liu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/09/13 11:11 AM

Updated:: 2024/02/16 9:40 AM

Resolved:: 2024/02/01 10:55 PM

Details

Description

Attachments

Activity

People

Dates

Hide