Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Cloud Compute / MachineHealthCheck
Labels:
- dragonfly
- machine-healthchecking

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

when a new node appears it is instantly destroyed

Version-Release number of selected component (if applicable):

OCP 4.12.36

How reproducible:

We're seeing that on 300+ nodes clusters on openstack. The stack is known to have some latency (network not that great, high etcd slow apply)

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

a node spawns and is immediately destroyed

Expected results:

remediation should occur using specified timers

Additional info:

as reported on https://github.com/openshift/machine-api-operator/pull/1237
which proposed a wrong fix for this issue:

The remediation happens without any log for its cause, and the only path in the code where that is the case is when the node.UID is empty. We do see a node Name in the target string() that is logged, so we know that a node.Name is present at that time.
[..]
I think we have a kind of race condition here: when the noderef on the Machine set, a reconcile is triggered. However, the node might not be in the MHC controller's cache yet, probably because of " large deployment on OpenStack, with high latency". We would need to better deal with that case, maybe with a "retry to get the node once" approach...

links to

handle nodes without UID #1237

MHC Testing in OCP of 330 worker node on AWS

Share in google drive with MG + Logs

Assignee:: Marc Sluiter

Reporter:: Francois Rigault

QA Contact:: Huali Liu

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2024/05/13 9:18 AM

Updated:: 2024/11/14 2:43 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates