-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
-
Description of problem:
Via ACM (an unreleased version) we deploy the gitops subscription (with
ARGOCD_CLUSTER_CONFIG_NAMESPACES='*') and an argo application inside the
cluster-wide argocd instance inside the openshift-gitops namespace. This argo
application has syncPolicy.{automated.selfHeal: true, retry.limit: 20}.
The application is stuck forever with the following message:
one or more objects failed to apply, reason: serviceaccounts is forbidden: User "system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller" cannot create resource "serviceaccounts" in API group "" in the namespace "imperative". Retrying attempt #1 at 8:16AM. STARTED AT 11 hours ago (Fri May 10 2024 10:16:00 GMT+0200)
No retries are done whatsoever. In fact it was stuck on retry number 1 for 11
hours. Note that doing a manual sync by clicking the sync button fixes it 100%
of the time. The permission problem, seems to be only a temporary one. Not entirely
sure why that is yet.
The current suspicion is that the newer ACM version somehow triggers this race
condition a lot more easily, but that it is ACM independent.
Analysis:
First we have the two failure messages related to the permission issue (see openshift-gitops-application-controller-0-argocd-application-controller.log):
time="2024-05-11T08:35:48Z" level=info msg="Apply failed" application=openshift-gitops/multicloud-gitops-group-one dryRun=false message="serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"" syncId=00001-GAokB task="Sync/0 resource /ServiceAccount:imperative/imperative-sa nil->obj (,,)" time="2024-05-11T08:35:48Z" level=info msg="Adding resource result, status: 'SyncFailed', phase: 'Failed', message: 'serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"'" application=openshift-gitops/multicloud-gitops-group-one kind=ServiceAccount name=imperative-sa namespace=imperative phase=Sync syncId=00001-GAokB time="2024-05-11T08:35:48Z" level=info msg="Apply failed" application=openshift-gitops/multicloud-gitops-group-one dryRun=false message="serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"" syncId=00001-GAokB task="Sync/0 resource /ServiceAccount:imperative/imperative-admin-sa nil->obj (,,)" time="2024-05-11T08:35:48Z" level=info msg="Adding resource result, status: 'SyncFailed', phase: 'Failed', message: 'serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"'" application=openshift-gitops/multicloud-gitops-group-one kind=ServiceAccount name=imperative-admin-sa namespace=imperative phase=Sync syncId=00001-GAokB
Very shortly afterwards the operation state is set to failed (this is relevant in the code below):
time="2024-05-11T08:35:49Z" level=info msg="Updating operation state. phase: Running -> Failed, message: 'one or more tasks are running' -> 'one or more objects failed to apply, reason: serviceaccounts is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot create resource \"serviceaccounts\" in API group \"\" in the namespace \"imperative\"'" application=openshift-gitops/multicloud-gitops-group-one syncId=00001-GAokB
After that we have a bunch of the following:
time="2024-05-11T08:35:49Z" level=info msg="Skipping retrying in-progress operation. Attempting again at: 2023-05-11T08:35:59Z" application=openshift-gitops/multicloud-gitops-group-one time="2024-05-11T08:35:49Z" level=info msg="Skipping retrying in-progress operation. Attempting again at: 2024-05-11T08:35:59Z" application=openshift-gitops/multicloud-gitops-group-one time="2024-05-11T08:35:49Z" level=info msg="Skipping retrying in-progress operation. Attempting again at: 2024-05-11T08:35:59Z" application=openshift-gitops/multicloud-gitops-group-one
A few seconds later we keep getting the following basically forever on:
time="2024-05-11T08:35:55Z" level=warning msg="Skipping auto-sync: failed previous sync attempt to 2914a7e93fbb1ff969871b416a0db3e08da6125f" application=openshift-gitops/multicloud-gitops-group-one
Looking at argo code in controller/appcontroller.go:
// It is possible for manifests to remain OutOfSync even after a sync/kubectl apply (e.g. // auto-sync with pruning disabled). We need to ensure that we do not keep Syncing an // application in an infinite loop. To detect this, we only attempt the Sync if the revision // and parameter overrides are different from our most recent sync operation. if alreadyAttempted && (!selfHeal || !attemptPhase.Successful()) { if !attemptPhase.Successful() { logCtx.Warnf("Skipping auto-sync: failed previous sync attempt to %s", desiredCommitSHA) message := fmt.Sprintf("Failed sync attempt to %s: %s", desiredCommitSHA, app.Status.OperationState.Message) return &appv1.ApplicationCondition\\{Type: appv1.ApplicationConditionSyncError, Message: message} , 0 } logCtx.Infof("Skipping auto-sync: most recent sync already to %s", desiredCommitSHA) return nil, 0 } else if alreadyAttempted && selfHeal { if shouldSelfHeal, retryAfter := ctrl.shouldSelfHeal(app); shouldSelfHeal { for _, resource := range resources { if resource.Status != appv1.SyncStatusCodeSynced { op.Sync.Resources = append(op.Sync.Resources, appv1.SyncOperationResource { Kind: resource.Kind, Group: resource.Group, Name: resource.Name, } ) } } } else { logCtx.Infof("Skipping auto-sync: already attempted sync to %s with timeout %v (retrying in %v)", desiredCommitSHA, ctrl.selfHealTimeout, retryAfter) ctrl.requestAppRefresh(app.QualifiedName(), CompareWithLatest.Pointer(), &retryAfter) return nil, 0 } }
Since selfHeal is set to true and the recurring log message is `Skipping
auto-sync: failed previous sync attempt...` so it must be the case that
`attemptPhase.Successful()` is returning false. We also know that because
the operation state has been set to failed after the first try.
So fundamentally we're stuck forever due to this initial
attemptPhase.Successful() that returned false even though selfHeal is set to true.
It seems that the additional `|| !!attemptPhase.Successful()` is misplaced?
Prerequisites (if any, like setup, operators/versions):
This is on OCP 4.15.12 and gitops 1.12.2
Steps to Reproduce
Non-trivial. Install the multicloud-gitops pattern on OCP via an unreleased ACM version.
I can produce a reproducer environment if needed.
Actual results:
Expected results:
Reproducibility (Always/Intermittent/Only Once):
100% so far
Acceptance criteria:
Definition of Done:
Build Details:
Additional info (Such as Logs, Screenshots, etc):
Full logs can be found here https://file.rdu.redhat.com/~mbaldess/acm-iib-gitops/
- Must gather (gitops image + base image + vp image) here:
https://file.rdu.redhat.com/~mbaldess/acm-iib-gitops/must-gather.local.1589738716304099710/ - Just the openshift-gitops NS pods logs + the openshift-operators gitops logs here:
https://file.rdu.redhat.com/~mbaldess/acm-iib-gitops/openshift-gitops+openshift-operators-logs/ - Tar gz of all files here:
https://file.rdu.redhat.com/~mbaldess/acm-iib-gitops/must-gather-logs1.tgz - Subscription for https://file.rdu.redhat.com/~mbaldess/acm-iib-gitops/gitops-subscription.yaml
- Application definition https://file.rdu.redhat.com/~mbaldess/acm-iib-gitops/outofsync-openshift-gitops-application.yaml