Uploaded image for project: 'OpenShift GitOps'
  1. OpenShift GitOps
  2. GITOPS-2880

Argo CD application controller stops reconciling under certain circumstances

    XMLWordPrintable

Details

    • 8
    • False
    • None
    • False
    • Hide
      Before this update, users reported that argocd-application-controller would suddenly stop working when re-syncing. This update fixes the issue by adding logic to prevent a cluster cache deadlock . Now, users should no longer face this deadlock situation and apps should resync successfully.
      Show
      Before this update, users reported that argocd-application-controller would suddenly stop working when re-syncing. This update fixes the issue by adding logic to prevent a cluster cache deadlock . Now, users should no longer face this deadlock situation and apps should resync successfully.
    • GITOPS Sprint 238, GITOPS Sprint 241, GITOPS Sprint 242
    • Important
    • Customer Escalated, Customer Facing

    Description

      Description of problem:

      The application controller is prone to a deadlock situation where all available status processors will be occupied but never released.

      I tend to believe this happens in a scenario when an Application is being deleted and has a resource finalizer set (i.e. Argo CD should also tear down the managed resources), and the Application's target cluster is being removed during the pruning.

      The only way to recover the application controller is to restart the pod.

      The issue is difficult to reproduce.

      Prerequisites (if any, like setup, operators/versions):

      We've seen this happening in customer setups under the following conditions:

      1. Customer is using ACM and ApplicationSet with ClusterDecisionResource (CDR) Generator
      2. The CDR (via ApplicationSet) deletes the managed applications targeting this cluster BEFORE the cluster has been removed from Argo CD's. Thus, the resource finalizer which is set by ApplicationSet by default is not removed.
      3. ACM removes the GitOpsCluster resource, triggering the removal of Argo CD's cluster configuration

      However, I believe it does not happen every time. So there must be a timing issue.

      While this situation is not optimal, it should not lead to a complete halt of the application controller but should lead to a recoverable error situation.

      Steps to Reproduce

      Unknown to this point.

      Actual results:

      Controller deadlocks without possibility of recovery

      Expected results:

      Controller does not deadlock; application's resource deletion fails gracefully

      Reproducibility (Always/Intermittent/Only Once):

      Random

      Acceptance criteria: 

      • Issue is reproduced in local dev environment 
      • When using GitOps through ACM, addition/deletion of clusters should not result in app-controller getting stuck and/or crashing
      • e2e test that simulates ACM behavior of adding/removing clusters to verify that core issue is addressed 

      Definition of Done:

      - Acceptance criteria is met

      Build Details:

      Additional info (Such as Logs, Screenshots, etc):

       

       *

      Attachments

        Issue Links

          Activity

            People

              jrao@redhat.com Jaideep Rao
              jfischer@redhat.com Jann Fischer
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 8 minutes
                  8m
                  Remaining:
                  Remaining Estimate - 8 minutes
                  8m
                  Logged:
                  Time Spent - Not Specified
                  Not Specified