Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: 4.16.0
Affects Version/s: 4.12
Component/s: Networking / On-Prem Load Balancer
Labels:
- OPNETriaged
- Pre-merge

Severity:
Important
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

Check scripts for the on-premise keepalived static pods only check the haproxy, which only directs to kube-apiserver pod. They do not take into consideration whether the control plane node has a healthy machine-config-server.

This may be a problem because, in a failure scenario, it may be required to rebuild nodes and machine-config-server is required for that (so that ignitions are provided).

One example is the etcd restore procedure (https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html). In our case, the following happened (I'd suggest reading the recovery procedure before this sequence of events):
- Machine config server was healthy in the recovery control plane node but not in the other hosts. 
- At this point, we can only guarantee the health of the recovery control plane node because the non-recovery ones are to be replaced and must be removed first from the cluster (node objects deleted) so that OVN-Kubernetes control plane can work properly.
- The keepalived check scripts were succeeding in the non-recovery control plane nodes because their haproxy pods were up and running. That is fine from kube-apiserver point of view, actually, but does not take machine config server into consideration.
- As the machine-config-server was not reachable, provision of the new masters required by the procedure was impossible.

In parallel to this bug, I'll be raising another bug to improve the restore procedure. Basically, asking to stop the keepalived static pods on the non-recovery control plane nodes. This would prevent the exact situation above.

However, there are other situations where machine-config-server pods may be unhealthy and we should not just be manually stopping keepalived. In such cases, keepalived should take machine-config-server into consideration.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Under some failure scenarios, where machine-config-server is not healthy in one control plane node.

Steps to Reproduce:

1. Try to provision new machine for recovery.
2.
3.

Actual results:

Machine-config-server not serving because keepalived assigned the VIP to one node that doesn't have a working machine-config-server pod.

Expected results:

Keepalived to take machine-config-server health into consideration while doing failover.

Additional info:

Possible ideas to fix:
- Create a check script for the machine-config-server check. It may have less weight than the kube-apiserver ones.
- Include machine-config-server endpoint in the haproxy of the kube-apiservers.

links to

openshift/machine-config-operator#4129: OCPBUGS-18940: Add keepalived healthcheck for machine-config-server

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Benjamin Nemec

Reporter:: Pablo Alonso Rodriguez

QA Contact:: Zhanqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/09/13 10:59 AM

Updated:: 2024/06/27 11:33 AM

Resolved:: 2024/06/27 11:33 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates