Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33584

remediation triggered immediately when a node appears

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      when a new node appears it is instantly destroyed
          

      Version-Release number of selected component (if applicable):

      OCP 4.12.36
          

      How reproducible:

      We're seeing that on 300+ nodes clusters on openstack. The stack is known to have some latency (network not that great, high etcd slow apply)
          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      a node spawns and is immediately destroyed
          

      Expected results:

      remediation should occur using specified timers
          

      Additional info:

      as reported on https://github.com/openshift/machine-api-operator/pull/1237
      which proposed a wrong fix for this issue:
      
      The remediation happens without any log for its cause, and the only path in the code where that is the case is when the node.UID is empty. We do see a node Name in the target string() that is logged, so we know that a node.Name is present at that time.
      [..]
      I think we have a kind of race condition here: when the noderef on the Machine set, a reconcile is triggered. However, the node might not be in the MHC controller's cache yet, probably because of " large deployment on OpenStack, with high latency". We would need to better deal with that case, maybe with a "retry to get the node once" approach... 
      
      
          

              slintes Marc Sluiter
              frigault Francois Rigault
              Huali Liu Huali Liu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: