Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-5842

performance issue due to tons of git clone actions

XMLWordPrintable

    • Generate cluster secrets with RBAC SA/User used in ArgoCD push model
    • False
    • None
    • False
    • Green
    • To Do

      Epic Goal

       

      In the current app subscription implementation, we clean up and re-clone the git repo per subscription in every reconciliation. This is quite an expensive action that caused the performance issue as reported below:

       

      https://access.redhat.com/support/cases/#/case/03530250

       

      Here are some key findings

       

      5. Connecting to those two pods in a terminal session, I then found that the cloned repository is stored in /tmp. This is ephemeral storage. As mentioned above, it is located on the sda disk of the node (in a container-specific directory).
      Each clone of the repository takes up 118 Mbyte.
       
      6. Based on 795 clones within the last 60 minutes (60 times 60 seconds = 3600 seconds) and a size of 118 MB per clone, this results in the following amount of data written to the storage per second:
      795 * 118MB / 3600s = 26.06 MB/s
      This is an average, in reality there will be spikes.
      
      7. ocplab4 is configured as a hub cluster. For managed clusters, the number of clones per hour is lower at 480, and amount of data written comes to 15.7 MB/s.
      
      8. The reason that the number of clones per hour is so high is that one clone is done for each RHACM subscription. This is repeated every 3 minutes.  With the 27 subscriptions we have, that’s 60 minutes / 3 minutes * 27 = 540, which roughly matches the count that Splunk gives us (528).
      

       

      Here is the workaround suggested to the customer 

       

      Change the reconcile rate to Low in the RHACM channel. This changes the reconciliation frequency from 3 to 60 minutes, and will reduce the amount of I/O by a factor of 20.
      This change is easy to implement, it’s a one-line change in the Channel resources on each hub cluster.
      If a configuration change needs to be applied before the 60 minutes are over, a reconciliation can be triggered in the RHACM Web UI. 
      

       

      There is another workaround 

       

      create a dedicate small git repo for each hub appsub, where only one application yamls are defined. So each appsub just clone a small git repo each time. 
      

       

      To resolve the performance issue, we can enhance the git repo connection like this:

      1. get the git repo root URL defined in channel.spec.pathname, then get its SHA256 checksum data as the folder name index. So each channel is supposed to have a unique 32 byte folder name
      2. First clone the git repo to its local folder
      3. In later loops, just do git fetch and merge. 

      So there will be two benefits: 

      1. if there is no new commit, we are expecting that git clone happens one time only if such a git repo local folder doesn't exist. git fetch and merge should be a much less expensive action in each loop.
      2. Each channel just has one local repo.  It is possible that multiple subscriptions could pull different branches / folders from the same git repo channel. In this case, all the subscriptions share the same local git repo.

      Why is this important?

      ...

      Scenarios

      ...

      Acceptance Criteria

      ...

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      1. ...

      Open questions:

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
        Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
        Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

      ACM Epic Done Checklist

      See presentation and details.

      Update with "Y" if Epic meets the requirement, "N" if it does not,  or "N/A" if not applicable.

      • _ FIPS Readiness
      • _ Works in Disconnected
      • _ Global Proxy Support
      • _ Installable to Infrastructure Nodes
      • _ No impacts to Performance and Scalability
      • _ Backup and Restorable

              xiangli@redhat.com Xiangjing Li
              xiangli@redhat.com Xiangjing Li
              David Huynh David Huynh
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: