Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 1.9.1
Affects Version/s: 1.8.2
Component/s: ArgoCD
Labels:

Story Points:
8
Blocked:
False
Blocked Reason:
None
Ready:
False
Release Note Text:

Hide
Before this update, users reported that argocd-application-controller would suddenly stop working when re-syncing. This update fixes the issue by adding logic to prevent a cluster cache deadlock . Now, users should no longer face this deadlock situation and apps should resync successfully.

Show
Before this update, users reported that argocd-application-controller would suddenly stop working when re-syncing. This update fixes the issue by adding logic to prevent a cluster cache deadlock . Now, users should no longer face this deadlock situation and apps should resync successfully.
Intelligence Requested:
Market:
Target Version:

1.9.1, 1.8.4, 1.7.5

Sprint:
GITOPS Sprint 238, GITOPS Sprint 241, GITOPS Sprint 242
Severity:
Important
Customer Impact:

Customer Escalated, Customer Facing

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The application controller is prone to a deadlock situation where all available status processors will be occupied but never released.

I tend to believe this happens in a scenario when an Application is being deleted and has a resource finalizer set (i.e. Argo CD should also tear down the managed resources), and the Application's target cluster is being removed during the pruning.

The only way to recover the application controller is to restart the pod.

The issue is difficult to reproduce.

Prerequisites (if any, like setup, operators/versions):

We've seen this happening in customer setups under the following conditions:

Customer is using ACM and ApplicationSet with ClusterDecisionResource (CDR) Generator
The CDR (via ApplicationSet) deletes the managed applications targeting this cluster BEFORE the cluster has been removed from Argo CD's. Thus, the resource finalizer which is set by ApplicationSet by default is not removed.
ACM removes the GitOpsCluster resource, triggering the removal of Argo CD's cluster configuration

However, I believe it does not happen every time. So there must be a timing issue.

While this situation is not optimal, it should not lead to a complete halt of the application controller but should lead to a recoverable error situation.

Steps to Reproduce

Unknown to this point.

Actual results:

Controller deadlocks without possibility of recovery

Expected results:

Controller does not deadlock; application's resource deletion fails gracefully

Reproducibility (Always/Intermittent/Only Once):

Random

Acceptance criteria:

Issue is reproduced in local dev environment
When using GitOps through ACM, addition/deletion of clusters should not result in app-controller getting stuck and/or crashing
e2e test that simulates ACM behavior of adding/removing clusters to verify that core issue is addressed

Definition of Done:

- Acceptance criteria is met

Build Details:

Additional info (Such as Logs, Screenshots, etc):

*

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

developer-gitops-application-controller-0-argocd-application-controller.log
2.40 MB
2023/05/12 12:42 PM

is cloned by

GITOPS-3052 Argo CD application controller stops reconciling under certain circumstances

Closed

is duplicated by

GITOPS-2673 Argo CD Application controller is stuck Syncing applications

Closed

is related to

GITOPS-2782 OpenShift GitOps Performance Issue (v1.9.1)

Closed

GITOPS-3192 OpenShift GitOps Performance Issue (v1.8.4)

Closed

links to

OpenShift GitOps application-controller stops working respectively syncing application on OpenShift Container Platform 4

openshift/openshift-docs#62126: ML-gitops-RN-1-9-1: Documented the GitOps 1.9.1 RNs.

(1 links to)

Assignee:: Jaideep Rao

Reporter:: Jann Fischer

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2023/04/27 10:06 PM

Updated:: 2023/07/27 3:36 PM

Resolved:: 2023/07/10 1:14 PM

Estimated:

Remaining:

Logged:

Not Specified

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Acceptance criteria:

- Acceptance criteria is met

Additional info (Such as Logs, Screenshots, etc):

*

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Time Tracking