-
Epic
-
Resolution: Done
-
Major
-
None
-
Add the possibility to retry posthooks when using the ClusterCurator resource
-
False
-
None
-
False
-
Green
-
0% To Do, 0% In Progress, 100% Done
We are using the ClusterCurator resource with ansible prehooks and posthooks to trigger upgrades of OpenShift clusters in ArgoCD. When the posthook fails in the curatorjob pod, there is no other way to retry the posthook than executing the ansible posthook manually in AWX.
Solution Proposal:
Per team discussion, here is the new proposal. By following the ArgoCD practice, we will append operator as the same level as spec. End user can specify operator.retryPosthook to retry the install/upgrade posthook one time
apiVersion: cluster.open-cluster-management.io/v1beta1 kind: ClusterCurator metadata: name: xjcluster1 namespace: xjcluster1 labels: open-cluster-management: curator operation: retryPosthook: installPosthook/upgradePosthook spec: desiredCuration: install/update install: towerAuthSecret: toweraccess prehook: - name: Demo Job Template extra_vars: sn_severity: 1 sn_priority: 1 appName: prehook job target_clusters: - my-cluster posthook: - name: Demo Job Template 2 extra_vars: sn_severity: 2 sn_priority: 2 appName: posthook job target_clusters: - my-cluster
Case 1: install
1. end user creates a cluster curator CR, specify spec.desiredCuration = install
2. cluster cuartor controller fills in the spec.curatorJob when the curation is started
3. if any curator job fails, the controller updates the clusterCurator status conditions, remove the spec.desiredCuration (this is the current implementation), remove the operation field as well.
4. Once the posthook failure is figured out, end user can set up operation.retryPosthook in the same clusterCurator, the cluster curator is reconciled to just do the specified posthook once. update the clusterCurator status condition, remove the operation field
5. Make sure the retry just runs once. the removal of the operation field won't be reconciled again and again
To do this, need to add a check in the cluster curator reconcile predicate function
https://github.com/stolostron/cluster-curator-controller/blob/4eaa3d1d9db2f908b5507a86ccb4b2d5a811bb08/controllers/clustercurator_controller.go#[…]1
if newClusterCurator.Operation != oldClusterCurator.Operation && newClusterCurator.Operation == nil{ return false }
6. if the posthook fails again, go to step 4
Case 2: upgrade
That is basically as same as the install case. spec.desiredCuration = "upgrade" in step 1, And in step 3, spec.desiredCuration is remained.
Note: The cluster curator CR could be maintained by ArgoCD with auto sync on. After the retry operation is manually added by user, ArgoCD application controller could be triggered to clean it up as the cluster curator in the git repo is the source of truth. And our curator controller will clean it up at the end of retry anyway. The operation clean up could happen twice. In this case, we need to make sure no additional action would happen.
ACM Epic Done Checklist
See presentation and details.
Update with "Y" if Epic meets the requirement, "N" if it does not, or "N/A" if not applicable.
- N/A FIPS Readiness
- Y Works in Disconnected
- N/A Global Proxy Support
- N/A Installable to Infrastructure Nodes
- Y No impacts to Performance and Scalability
- N/A Backup and Restorable