Uploaded image for project: 'OpenShift GitOps'
  1. OpenShift GitOps
  2. GITOPS-4677

If the first sync attempt has failed, selfHeal has no effect

XMLWordPrintable

    • False
    • None
    • False

      Description of problem:

      Via ACM (an unreleased version) we deploy the gitops subscription (with
      ARGOCD_CLUSTER_CONFIG_NAMESPACES='*') and an argo application inside the
      cluster-wide argocd instance inside the openshift-gitops namespace. This argo
      application has syncPolicy.{automated.selfHeal: true, retry.limit: 20}.

      The application is stuck forever with the following message:

      one or more objects failed to apply, reason: serviceaccounts is forbidden: User "system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller" cannot create resource "serviceaccounts" in API group "" in the namespace "imperative". Retrying attempt #1 at 8:16AM.
      STARTED AT 11 hours ago (Fri May 10 2024 10:16:00 GMT+0200)
      

      No retries are done whatsoever. In fact it was stuck on retry number 1 for 11
      hours. Note that doing a manual sync by clicking the sync button fixes it 100%
      of the time. The permission problem, seems to be only a temporary one. Not entirely
      sure why that is yet.

      The current suspicion is that the newer ACM version somehow triggers this race
      condition a lot more easily, but that it is ACM independent.

      Analysis:

      First we have the two failure messages related to the permission issue (see openshift-gitops-application-controller-0-argocd-application-controller.log):

      time="2024-05-11T08:35:48Z" level=info msg="Apply failed" application=openshift-gitops/multicloud-gitops-group-one dryRun=false message="serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"" syncId=00001-GAokB task="Sync/0 resource /ServiceAccount:imperative/imperative-sa nil->obj (,,)"
      time="2024-05-11T08:35:48Z" level=info msg="Adding resource result, status: 'SyncFailed', phase: 'Failed', message: 'serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"'" application=openshift-gitops/multicloud-gitops-group-one kind=ServiceAccount name=imperative-sa namespace=imperative phase=Sync syncId=00001-GAokB
      time="2024-05-11T08:35:48Z" level=info msg="Apply failed" application=openshift-gitops/multicloud-gitops-group-one dryRun=false message="serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"" syncId=00001-GAokB task="Sync/0 resource /ServiceAccount:imperative/imperative-admin-sa nil->obj (,,)"
      time="2024-05-11T08:35:48Z" level=info msg="Adding resource result, status: 'SyncFailed', phase: 'Failed', message: 'serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"'" application=openshift-gitops/multicloud-gitops-group-one kind=ServiceAccount name=imperative-admin-sa namespace=imperative phase=Sync syncId=00001-GAokB
      

      Very shortly afterwards the operation state is set to failed (this is relevant in the code below):

      time="2024-05-11T08:35:49Z" level=info msg="Updating operation state. phase: Running -> Failed, message: 'one or more tasks are running' -> 'one or more objects failed to apply, reason: serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"'" application=openshift-gitops/multicloud-gitops-group-one syncId=00001-GAokB
      

      After that we have a bunch of the following:

      time="2024-05-11T08:35:49Z" level=info msg="Skipping retrying in-progress operation. Attempting again at: 2023-05-11T08:35:59Z" application=openshift-gitops/multicloud-gitops-group-one
      time="2024-05-11T08:35:49Z" level=info msg="Skipping retrying in-progress operation. Attempting again at: 2024-05-11T08:35:59Z" application=openshift-gitops/multicloud-gitops-group-one
      time="2024-05-11T08:35:49Z" level=info msg="Skipping retrying in-progress operation. Attempting again at: 2024-05-11T08:35:59Z" application=openshift-gitops/multicloud-gitops-group-one
      

      A few seconds later we keep getting the following basically forever on:

      time="2024-05-11T08:35:55Z" level=warning msg="Skipping auto-sync: failed previous sync attempt to 2914a7e93fbb1ff969871b416a0db3e08da6125f" application=openshift-gitops/multicloud-gitops-group-one
      

      Looking at argo code in controller/appcontroller.go:

      // It is possible for manifests to remain OutOfSync even after a sync/kubectl apply (e.g.
      // auto-sync with pruning disabled). We need to ensure that we do not keep Syncing an
      // application in an infinite loop. To detect this, we only attempt the Sync if the revision
      // and parameter overrides are different from our most recent sync operation.
      if alreadyAttempted && (!selfHeal || !attemptPhase.Successful()) {
          if !attemptPhase.Successful()
      
      {         logCtx.Warnf("Skipping auto-sync: failed previous sync attempt to %s", desiredCommitSHA)         message := fmt.Sprintf("Failed sync attempt to %s: %s", desiredCommitSHA, app.Status.OperationState.Message)         return &appv1.ApplicationCondition\\{Type: appv1.ApplicationConditionSyncError, Message: message}
      
      , 0
          }
          logCtx.Infof("Skipping auto-sync: most recent sync already to %s", desiredCommitSHA)
          return nil, 0
      } else if alreadyAttempted && selfHeal {
          if shouldSelfHeal, retryAfter := ctrl.shouldSelfHeal(app); shouldSelfHeal {
              for _, resource := range resources {
                  if resource.Status != appv1.SyncStatusCodeSynced {
                      op.Sync.Resources = append(op.Sync.Resources, appv1.SyncOperationResource
      
      {                     Kind:  resource.Kind,                     Group: resource.Group,                     Name:  resource.Name,                 }
      
      )
                  }
              }
          } else
      
      {         logCtx.Infof("Skipping auto-sync: already attempted sync to %s with timeout %v (retrying in %v)", desiredCommitSHA, ctrl.selfHealTimeout, retryAfter)         ctrl.requestAppRefresh(app.QualifiedName(), CompareWithLatest.Pointer(), &retryAfter)         return nil, 0     }
      
      }
      

      Since selfHeal is set to true and the recurring log message is `Skipping
      auto-sync: failed previous sync attempt...` so it must be the case that
      `attemptPhase.Successful()` is returning false. We also know that because
      the operation state has been set to failed after the first try.

      So fundamentally we're stuck forever due to this initial
      attemptPhase.Successful() that returned false even though selfHeal is set to true.
      It seems that the additional `|| !!attemptPhase.Successful()` is misplaced?

      Prerequisites (if any, like setup, operators/versions):

      This is on OCP 4.15.12 and gitops 1.12.2

      Steps to Reproduce

      Non-trivial. Install the multicloud-gitops pattern on OCP via an unreleased ACM version.

      I can produce a reproducer environment if needed.

      Actual results:

      Expected results:

      Reproducibility (Always/Intermittent/Only Once):

      100% so far

      Acceptance criteria: 

       

      Definition of Done:

      Build Details:

      Additional info (Such as Logs, Screenshots, etc):

       Full logs can be found here https://file.rdu.redhat.com/~mbaldess/acm-iib-gitops/

            Unassigned Unassigned
            rhn-support-mbaldess Michele Baldessari
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: