Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5150

Cluster not deleted correctly when BMH finalizing takes some time

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • 4.10
    • GitOps ZTP
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The objects we have in ArgoCD are managed by sync-waves, so, in theory, the NS should be deleted the last. And I think it is done, it is deleted the last. Nut now waiting the other objects to be deleted. 
      Actually, if you delete the cluster, you will immediately see the NS marked to be deleted (terminating). Maybe it was the last, but it didnt wait. sync-waves are used for order, but there is no logic about waiting.
      
      I have done an experiment that demonstrates that, if the NS is not deleted (or it is deleted after the other resources were deleted), everything works oka. To force that, I synch everything but the NS:
      
      https://github.com/jgato/jgato/blob/main/random_docs/Demonstrating%20the%20waves%20are%20not%20working.md
      
      So, ZTP/ArgoCD needs a new mechanisms to control that NS is deleted the last. But also waiting the other resources to be deleted.
      The issue is only raised if some of the resources takes some time to finalize. For example, BMH when deprovisioning if something went wrong. The BMO needs more time to finish, and the NS has been marked for terminating.
      
      

       

       

      Version-Release number of selected component (if applicable):

      ztp4.10

      How reproducible:

      Create a cluster.
      Makes something wrong, so the BMH installation fails, and it will take longer to deprovision. For example, a non existing rootDeviceHint.
      Delete the cluster

       

       

      Steps to Reproduce:

      1. Create a cluster with your siteconfig
      2. Configure at least one host with a wrong rootdevicehint
      3. Synch
      4. The cluster is created on the hub. But it will never finish the installation.
      5. Delete the cluster from your siteconfig
      6. Argocd will try to sync and it will delete all the objects.
      7. There is no wait between sync-waves. So the NS is also marekd to be deleted
      8. Because of that, the BMH never finish its deletion and the cluster gets stuck.
      
      

      Actual results:

      The siteconfig deletion gets stuck. You cannot recreate

      Expected results:

      The siteconfig is deleted

      Additional info:

      This related to this other bug:
      https://issues.redhat.com/browse/OCPBUGS-3029
      there is a deep explanation about what this only happens when something went wrong with the cluster.

            rhn-support-imiller Ian Miller
            jgato@redhat.com Jose Gato Luis
            Periyamaruthu Mohanraj Periyamaruthu Mohanraj
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: