Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59376

GCP scaling is slow when scaling large volumes of nodes

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Done
    • Bug Fix
    • Hide
      * Before this update, scaling large numbers of nodes was slow because scaling requires reconciling each machine several times and each machine was reconciled individually. With this release, up to ten machines can be reconciled concurrently. This change improves the processing speed for machines during scaling. (link:https://issues.redhat.com/browse/OCPBUGS-59376[OCPBUGS-59376])
      Show
      * Before this update, scaling large numbers of nodes was slow because scaling requires reconciling each machine several times and each machine was reconciled individually. With this release, up to ten machines can be reconciled concurrently. This change improves the processing speed for machines during scaling. (link: https://issues.redhat.com/browse/OCPBUGS-59376 [ OCPBUGS-59376 ])
    • None
    • None
    • None
    • None

      Description of problem:

          In a recent hackathon with Amadeus, we found that scaling of Nodes on GCP (0-400) was bottlenecked by sequential processing of reconcile requests in the Machine API provider for GCP.
      
      Adding the ability to configure and then scale the nodes using parallel execution of 10 reconciles at once, significantly improved the performance.

      Version-Release number of selected component (if applicable):

          4.20 and below

      How reproducible:

          100%

      Steps to Reproduce:

          1. Create an OCP cluster on GCP
          2. Scale several machinesets to a total of around 400 nodes
          3. Observe machines take approximately 20 minutes to join the cluster
      
      
      Below steps shared by Zhaohua Sun
      
      1.set up a cluster with flexy-install, you can rebuild this job, just update INSTANCE_NAME_PREFIX and LAUNCHER_VARS , add below to LAUNCHER_VARS
      vm_type_masters: 'n2-standard-16'
      vm_type_workers: 'n2-standard-2'
      
      2. create infra nodes, you can rebuild this job , update BUILD_NUMBER  with your flexy job id
      
      3. once the above are down, scale up machineset, as said in bug, I scale to 400 nodes by 3 times. the first time
      oc scale machineset zhsungcp-djlkm-worker-b --replicas 50                                
      oc scale machineset zhsungcp-djlkm-worker-c --replicas 50
      oc scale machineset zhsungcp-djlkm-worker-d --replicas 50
         the second time
      $ oc scale machineset zhsungcp-djlkm-worker-b --replicas 100                            
      oc scale machineset zhsungcp-djlkm-worker-c --replicas 100
      oc scale machineset zhsungcp-djlkm-worker-d --replicas 100
         the third time
      oc scale machineset zhsungcp-djlkm-worker-b --replicas 130                         
      oc scale machineset zhsungcp-djlkm-worker-c --replicas 130
      oc scale machineset zhsungcp-djlkm-worker-d --replicas 140

      Actual results:

          Nodes take a significant time to join the cluster

      Expected results:

          Nodes should join the cluster quickly

      Additional info:

          

              joelspeed Joel Speed
              joelspeed Joel Speed
              None
              None
              Meha Bhalodiya Meha Bhalodiya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: