This is a clone of issue OCPBUGS-52872. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-51864. The following is the description of the original issue:
—
Description of problem:
It appears that during cluster creation, when MAPI starts up and begins to manage Machines in an IPI deployed cluster on IBM Cloud, it can detect an unhealthy CP node and attempt a one time replacement of that node, effectively destroying the cluster.
Version-Release number of selected component (if applicable):
4.19
How reproducible:
< 10%
Steps to Reproduce:
A potential way to reproduce, but is a relatively small timing window to meet 1. Create a new IPI cluster on IBM Cloud 2. Attempt to Stop a CP node once MAPI starts deploying, to allow MAPI to believe a CP node needs replacement. (This is an extremely tight window) Replication may not be possible manually, and only just by luck.
Actual results:
One or more CP nodes get replaced during cluster creation, destroying etcd and other deployment of critical CP workloads, effectively breaking the cluster.
Expected results:
Successful cluster deployment.
Additional info:
Back when OCP was using RHEL 8 (RHCOS base), a known bug with NetworkManager caused the loss of the assigned IP a new IBM Cloud Instance (VSI), resulting in the new Instance never being able to start up with dracut and Ignition to work with the MCO. Because of this bug with NetworkManager, a fix was created to force a one time replacement of that VSI by MAPI, to try to resolve this issue, and allow the VSI to complete bringup and report into the cluster. https://issues.redhat.com/browse/OCPBUGS-1327 Unfortunately at that time, this appeared to only affect worker nodes, but in a few cases, it appears it is now affecting CP nodes as well, which was not the intention. I will add some logs and details with what I think is proof that MAPI is performing this same replacement on CP nodes.
- clones
-
OCPBUGS-52872 [IBMCloud] MAPI replacing unhealthy CP nodes
-
- ON_QA
-
- is blocked by
-
OCPBUGS-52872 [IBMCloud] MAPI replacing unhealthy CP nodes
-
- ON_QA
-
- links to