-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.12.z
-
No
-
False
-
Description of problem:
when a new node appears it is instantly destroyed
Version-Release number of selected component (if applicable):
OCP 4.12.36
How reproducible:
We're seeing that on 300+ nodes clusters on openstack. The stack is known to have some latency (network not that great, high etcd slow apply)
Steps to Reproduce:
1. 2. 3.
Actual results:
a node spawns and is immediately destroyed
Expected results:
remediation should occur using specified timers
Additional info:
as reported on https://github.com/openshift/machine-api-operator/pull/1237 which proposed a wrong fix for this issue: The remediation happens without any log for its cause, and the only path in the code where that is the case is when the node.UID is empty. We do see a node Name in the target string() that is logged, so we know that a node.Name is present at that time. [..] I think we have a kind of race condition here: when the noderef on the Machine set, a reconcile is triggered. However, the node might not be in the MHC controller's cache yet, probably because of " large deployment on OpenStack, with high latency". We would need to better deal with that case, maybe with a "retry to get the node once" approach...