Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.10
Component/s: GitOps ZTP
Labels:
- telco-priority-3

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Internal Whiteboard:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

The objects we have in ArgoCD are managed by sync-waves, so, in theory, the NS should be deleted the last. And I think it is done, it is deleted the last. Nut now waiting the other objects to be deleted. 
Actually, if you delete the cluster, you will immediately see the NS marked to be deleted (terminating). Maybe it was the last, but it didnt wait. sync-waves are used for order, but there is no logic about waiting.

I have done an experiment that demonstrates that, if the NS is not deleted (or it is deleted after the other resources were deleted), everything works oka. To force that, I synch everything but the NS:

https://github.com/jgato/jgato/blob/main/random_docs/Demonstrating%20the%20waves%20are%20not%20working.md

So, ZTP/ArgoCD needs a new mechanisms to control that NS is deleted the last. But also waiting the other resources to be deleted.
The issue is only raised if some of the resources takes some time to finalize. For example, BMH when deprovisioning if something went wrong. The BMO needs more time to finish, and the NS has been marked for terminating.

Version-Release number of selected component (if applicable):

ztp4.10

How reproducible:

Create a cluster.
Makes something wrong, so the BMH installation fails, and it will take longer to deprovision. For example, a non existing rootDeviceHint.
Delete the cluster

Steps to Reproduce:

1. Create a cluster with your siteconfig
2. Configure at least one host with a wrong rootdevicehint
3. Synch
4. The cluster is created on the hub. But it will never finish the installation.
5. Delete the cluster from your siteconfig
6. Argocd will try to sync and it will delete all the objects.
7. There is no wait between sync-waves. So the NS is also marekd to be deleted
8. Because of that, the BMH never finish its deletion and the cluster gets stuck.

Actual results:

The siteconfig deletion gets stuck. You cannot recreate

Expected results:

The siteconfig is deleted

Additional info:

This related to this other bug:
https://issues.redhat.com/browse/OCPBUGS-3029
there is a deep explanation about what this only happens when something went wrong with the cluster.

is related to

OCPBUGS-42945 BareMetalHost CR fails to delete on cluster cleanup

Verified

Assignee:: Ian Miller

Reporter:: Jose Gato Luis

QA Contact:: Periyamaruthu Mohanraj

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2022/12/22 3:34 PM

Updated:: 2024/11/06 4:56 PM

Resolved:: 2023/07/13 9:08 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates