-
Bug
-
Resolution: Obsolete
-
Normal
-
None
-
None
-
None
-
False
-
None
-
False
-
-
Description of problem:
since a few months (~5-6 months) in our CI for Validated Patterns, we’ve started experiencing a certain rate of failures (15-20% or so) in our jobs. Investigating this further showed that all failing jobs would not proceed due to the following error:
Error persisting normalized application spec: applications.argoproj.io \"industrial-edge-datacenter\" is forbidden: User \"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller\" cannot patch resource \"applications\" in API group \"argoproj.io\" in the namespace \"openshift-gitops\"" application=industrial-edge-datacenter{}
In the failing jobs (this is not always reproducible) it seems that the permissions for the openshift-gitops-argocd-application-controller are missing some bits, namely at least all the permissions around argo applications.{}
At http://file.rdu.redhat.com/~mbaldess/mlabonte-gitops-permission-timeout/diff-broken-working.txt we can find the diff between a broken environment where the permissions are incomplete and an environment where I installed gitops 1.5.10 by hand.{}
Additional information:
- So far this seems cloud independent (we’ve observed this both on AWS and Azure. We do not test bare metal currently)
- We install Gitops using the “stable” channel, that is where the 1.5.10 is coming from
- This seems to be independent of the OCP version itself. We have seen this with at least 4.10.x and 4.11.x
- http://file.rdu.redhat.com/~mbaldess/mlabonte-gitops-permission-timeout/ie-hub-mark-azure-1001.css-qe.com/ has argo pod logs and olm logs as well from a broken environment
- I saw no differences between clusterroles between a broken and a working environment
- Roles were of a broken env were dumped here: http://file.rdu.redhat.com/~mbaldess/mlabonte-gitops-permission-timeout/roles/ . Working environment roles are here http://file.rdu.redhat.com/~mbaldess/mlabonte-gitops-permission-timeout/working-env/roles/
Are there any other logs/information that I should try and provide to understand this issue a bit more? My current gut feeling is that the RBACs that depend on the ArgoCD CRD that defines applications & co, are not being applied, maybe because when they are being created the CRD has not been fully registered in the APIs?
Seemingly, we started observing this on and off since last September (see https://issues.redhat.com/browse/MBP-353)
Reproducibility (Always/Intermittent/Only Once):
Intermittent
Build Details:
1.5.10 and 1.7.x
OCP 4.10.x and 4.11.x so far{}