Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-29337

It takes 20 hours to enable MCOA on 2673 managed clusters

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • ACM 2.16.0
    • Observability
    • Critical
    • None

      Description of problem:

      I'm trying to follow https://docs.google.com/document/d/1joCg1kN4yd6HZ_GzJTRYXjNPvJGz_p4jhw5r26zYCI4/edit?tab=t.0 to enable MCOA from a hub with 2672 managed SNOs. After about 3 hours only 18 managed cluster has the MCOA installed. and seems they were installed  at the beginning and then everything else is kind of stuck.
      the attached multicluster-observability-addon-manager-58bdb9b87d-7ngd8-manager.logshows that one "CSR approved"every couple of seconds.  Totally there are 2672 of them, matches the number of managed cluster. but the line has no managed cluster name so I don't know when and which managed cluster was approved. I've opened a usability bug ACM-29306for that. 
      As shown below the fist CSR approved at 14:14:44 and the last one was at 15:22:49, which is more than 1hour later. 

      # oc logs -n open-cluster-management-observability              multicluster-observability-addon-manager-58bdb9b87d-7ngd8 |grep "CSR approved" |wc
         2674   16044  168462
      # oc logs -n open-cluster-management-observability              multicluster-observability-addon-manager-58bdb9b87d-7ngd8 |grep "CSR approved" |head -1
      I0130 14:14:44.686282       1 csr_helpers.go:180] CSR approved
      # oc logs -n open-cluster-management-observability              multicluster-observability-addon-manager-58bdb9b87d-7ngd8 |grep "CSR approved" |tail -1
      I0130 15:22:49.725100       1 csr_helpers.go:180] CSR approved

      when I check one of the managed cluster that has MCOA install issue, the 

      klusterlet-agent shows that there is about 50 min gap between the line showing the old addon was removed and the line showing starting installing the new agent.

       

      I0130 14:13:56.047172       1 helpers.go:201] "Resource is removed successfully" gvr="/v1, Resource=namespaces" resourceNamespace="" resourceName="open-cluster-management-addon-observability"
      I0130 15:04:26.470242       1 base_controller.go:83] "Starting worker of controller ..." logger="ClientCertController@addon:multicluster-observability-addon:signer:kubernetes.io/kube-apiserver-client" worker-ID=1

      also the multicluster-observability-addon-manager-58bdb9b87d-7ngd8-manager.log  logs show that there is only one worker for CSRApprovingController and addon config/deploy controller,  maybe that's why things are stuck?

      I0130 14:08:36.107056       1 base_controller.go:78] Starting #1 worker of addon-deploy-controller controller ...
      I0130 14:08:36.107059       1 base_controller.go:78] Starting #1 worker of addon-config-controller controller ...
      I0130 14:08:36.107131       1 base_controller.go:40] Caches are synced for CSRSignController 
      I0130 14:08:36.107145       1 base_controller.go:78] Starting #1 worker of CSRSignController controller ...
      I0130 14:08:36.107384       1 base_controller.go:40] Caches are synced for addon-registration-controller 
      I0130 14:08:36.107407       1 base_controller.go:78] Starting #1 worker of addon-registration-controller controller ...
      I0130 14:08:36.108300       1 base_controller.go:40] Caches are synced for CSRApprovingController 
      I0130 14:08:36.108319       1 base_controller.go:78] Starting #1 worker of CSRApprovingController controller ...

      attaching an example klusterlet-agent log from a managed cluster has issue klusterlet-agent-vm00002.log

      and an example klusterlet-agent log from a managed cluster that has no MCOA install issue for the reference :klusterlet-agent-vm00168.log

       

      Version-Release number of selected component (if applicable):

      How reproducible:

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

              rh-ee-tmange Thibault Mange
              rhn-support-txue Ting Xue
              ACM QE Team
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: