-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
ACM 2.16.0
-
Quality / Stability / Reliability
-
False
-
-
False
-
-
-
Critical
-
None
Description of problem:
I'm trying to follow https://docs.google.com/document/d/1joCg1kN4yd6HZ_GzJTRYXjNPvJGz_p4jhw5r26zYCI4/edit?tab=t.0 to enable MCOA from a hub with 2672 managed SNOs. After about 3 hours only 18 managed cluster has the MCOA installed. and seems they were installed at the beginning and then everything else is kind of stuck.
the attached multicluster-observability-addon-manager-58bdb9b87d-7ngd8-manager.log
shows that one "CSR approved"every couple of seconds. Totally there are 2672 of them, matches the number of managed cluster. but the line has no managed cluster name so I don't know when and which managed cluster was approved. I've opened a usability bug ACM-29306for that.
As shown below the fist CSR approved at 14:14:44 and the last one was at 15:22:49, which is more than 1hour later.
# oc logs -n open-cluster-management-observability multicluster-observability-addon-manager-58bdb9b87d-7ngd8 |grep "CSR approved" |wc 2674 16044 168462 # oc logs -n open-cluster-management-observability multicluster-observability-addon-manager-58bdb9b87d-7ngd8 |grep "CSR approved" |head -1 I0130 14:14:44.686282 1 csr_helpers.go:180] CSR approved # oc logs -n open-cluster-management-observability multicluster-observability-addon-manager-58bdb9b87d-7ngd8 |grep "CSR approved" |tail -1 I0130 15:22:49.725100 1 csr_helpers.go:180] CSR approved
when I check one of the managed cluster that has MCOA install issue, the
klusterlet-agent shows that there is about 50 min gap between the line showing the old addon was removed and the line showing starting installing the new agent.
I0130 14:13:56.047172 1 helpers.go:201] "Resource is removed successfully" gvr="/v1, Resource=namespaces" resourceNamespace="" resourceName="open-cluster-management-addon-observability" I0130 15:04:26.470242 1 base_controller.go:83] "Starting worker of controller ..." logger="ClientCertController@addon:multicluster-observability-addon:signer:kubernetes.io/kube-apiserver-client" worker-ID=1
also the multicluster-observability-addon-manager-58bdb9b87d-7ngd8-manager.log
logs show that there is only one worker for CSRApprovingController and addon config/deploy controller, maybe that's why things are stuck?
I0130 14:08:36.107056 1 base_controller.go:78] Starting #1 worker of addon-deploy-controller controller ... I0130 14:08:36.107059 1 base_controller.go:78] Starting #1 worker of addon-config-controller controller ... I0130 14:08:36.107131 1 base_controller.go:40] Caches are synced for CSRSignController I0130 14:08:36.107145 1 base_controller.go:78] Starting #1 worker of CSRSignController controller ... I0130 14:08:36.107384 1 base_controller.go:40] Caches are synced for addon-registration-controller I0130 14:08:36.107407 1 base_controller.go:78] Starting #1 worker of addon-registration-controller controller ... I0130 14:08:36.108300 1 base_controller.go:40] Caches are synced for CSRApprovingController I0130 14:08:36.108319 1 base_controller.go:78] Starting #1 worker of CSRApprovingController controller ...
attaching an example klusterlet-agent log from a managed cluster has issue klusterlet-agent-vm00002.log![]()
and an example klusterlet-agent log from a managed cluster that has no MCOA install issue for the reference :klusterlet-agent-vm00168.log![]()
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
- ...