-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
4.10
-
None
-
False
-
-
Description of problem:
The objects we have in ArgoCD are managed by sync-waves, so, in theory, the NS should be deleted the last. And I think it is done, it is deleted the last. Nut now waiting the other objects to be deleted. Actually, if you delete the cluster, you will immediately see the NS marked to be deleted (terminating). Maybe it was the last, but it didnt wait. sync-waves are used for order, but there is no logic about waiting. I have done an experiment that demonstrates that, if the NS is not deleted (or it is deleted after the other resources were deleted), everything works oka. To force that, I synch everything but the NS: https://github.com/jgato/jgato/blob/main/random_docs/Demonstrating%20the%20waves%20are%20not%20working.md So, ZTP/ArgoCD needs a new mechanisms to control that NS is deleted the last. But also waiting the other resources to be deleted. The issue is only raised if some of the resources takes some time to finalize. For example, BMH when deprovisioning if something went wrong. The BMO needs more time to finish, and the NS has been marked for terminating.
Version-Release number of selected component (if applicable):
ztp4.10
How reproducible:
Create a cluster. Makes something wrong, so the BMH installation fails, and it will take longer to deprovision. For example, a non existing rootDeviceHint. Delete the cluster
Steps to Reproduce:
1. Create a cluster with your siteconfig 2. Configure at least one host with a wrong rootdevicehint 3. Synch 4. The cluster is created on the hub. But it will never finish the installation. 5. Delete the cluster from your siteconfig 6. Argocd will try to sync and it will delete all the objects. 7. There is no wait between sync-waves. So the NS is also marekd to be deleted 8. Because of that, the BMH never finish its deletion and the cluster gets stuck.
Actual results:
The siteconfig deletion gets stuck. You cannot recreate
Expected results:
The siteconfig is deleted
Additional info:
This related to this other bug: https://issues.redhat.com/browse/OCPBUGS-3029 there is a deep explanation about what this only happens when something went wrong with the cluster.
- is related to
-
OCPBUGS-42945 BareMetalHost CR fails to delete on cluster cleanup
- Verified