Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-2028

RFE Add the possibility to retry posthooks when using the ClusterCurator resource

XMLWordPrintable

    • Add the possibility to retry posthooks when using the ClusterCurator resource
    • False
    • None
    • False
    • Green
    • 0% To Do, 0% In Progress, 100% Done

      We are using the ClusterCurator resource with ansible prehooks and posthooks to trigger upgrades of OpenShift clusters in ArgoCD. When the posthook fails in the curatorjob pod, there is no other way to retry the posthook than executing the ansible posthook manually in AWX.

       

      Solution Proposal:

       

      Per team discussion, here is the new proposal.  By following the ArgoCD practice, we will append operator as the same level as spec. End user can specify operator.retryPosthook to retry the install/upgrade posthook one time

       

      apiVersion: cluster.open-cluster-management.io/v1beta1
      kind: ClusterCurator
      metadata:
        name: xjcluster1
        namespace: xjcluster1
        labels:
          open-cluster-management: curator
      operation:
        retryPosthook: installPosthook/upgradePosthook 
      spec:
        desiredCuration: install/update
        install:
          towerAuthSecret: toweraccess
          prehook:
          - name: Demo Job Template
            extra_vars:
              sn_severity: 1
              sn_priority: 1
              appName: prehook job
              target_clusters:
                - my-cluster
          posthook:
          - name: Demo Job Template 2
            extra_vars:
              sn_severity: 2
              sn_priority: 2
              appName: posthook job
              target_clusters:
                - my-cluster
       
      

      Case 1: install

      1. end user creates a cluster curator CR, specify spec.desiredCuration = install

      2. cluster cuartor controller fills in the spec.curatorJob when the curation is started

      3. if any curator job fails, the controller updates the clusterCurator status conditions, remove the spec.desiredCuration (this is the current implementation), remove the operation field as well. 

      4. Once the posthook failure is figured out, end user can set up operation.retryPosthook in the same clusterCurator, the cluster curator is reconciled to just do the specified posthook once. update the clusterCurator status condition, remove the operation field

      5. Make sure the retry just runs once. the removal of the operation field won't be reconciled again and again
      To do this, need to add a check in the cluster curator reconcile predicate function
      https://github.com/stolostron/cluster-curator-controller/blob/4eaa3d1d9db2f908b5507a86ccb4b2d5a811bb08/controllers/clustercurator_controller.go#[…]1

       

      if newClusterCurator.Operation != oldClusterCurator.Operation && newClusterCurator.Operation == nil{
          return false
      }
       

       

      6. if the posthook fails again, go to step 4
       
      Case 2: upgrade
      That is basically as same as the install case. spec.desiredCuration = "upgrade" in step 1, And in step 3,  spec.desiredCuration is remained.

       

      Note: The cluster curator CR could be maintained by ArgoCD with auto sync on. After the retry operation is manually added by user, ArgoCD application controller could be triggered to clean it up as the cluster curator in the git repo is the source of truth.  And our curator controller will clean it up at the end of retry anyway.  The operation clean up could happen twice. In this case, we need to make sure no additional action would happen.

       

      ACM Epic Done Checklist

      See presentation and details.

      Update with "Y" if Epic meets the requirement, "N" if it does not,  or "N/A" if not applicable.

      • N/A FIPS Readiness
      • Y Works in Disconnected
      • N/A Global Proxy Support
      • N/A Installable to Infrastructure Nodes
      • Y No impacts to Performance and Scalability
      • N/A Backup and Restorable

              fxiang@redhat.com Feng Xiang
              rh-ee-yajbar Younes Ajbar
              Atif Shafi Atif Shafi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: