Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-17235

[2069098] Large scale |VMs migration is slow due to low migration parallelism

    XMLWordPrintable

Details

    • CNV Virtualization Sprint 225, CNV Virtualization Sprint 226, CNV Virtualization Sprint 227, CNV Virtualization Sprint 228, CNV Virtualization Sprint 229
    • High

    Description

      Some background:
      -------------------------
      I'm running a scale OpenShift setup with 100 OpenShift nodes as a preparation for an environment that was requested by a customer, with 47 RHCS 5.0 hosts as an external storage cluster, however, the migration was on the slow side, so I played around with the tweaks mentioned below to find the fastest & safest config
      to complete such migrations, but no matter what I did, the total migration time from start to finish was very similar which was odd.

      this setup is currently running 3000 VMs:
      1500 RHEL 8.5 persistent storage VMs
      500 Windows10 persistent storage VMs.
      1000 Fedora Ephemeral storage VMs.

      The workers are divided to 3 zones:
      worker000 - worker031. = Zone0
      worker032 - worker062. = Zone1
      worker033 - worker096. = Zone2

      I start the migration by applying an empty machineconfig to zone-2
      which then causes the nodes to start draining.

      this is my migration config:
      --------------------
      liveMigrationConfig:
      completionTimeoutPerGiB: 800
      parallelMigrationsPerCluster: 20
      parallelOutboundMigrationsPerNode: 4
      progressTimeout: 150
      workloads: {}
      --------------------

      this is zone-2 config:
      --------------------
      maxUnavailable: 10
      nodeSelector:
      matchLabels:
      node-role.kubernetes.io/zone-2: ""
      --------------------

      another thing worth mentioning is that I'm running a custom kubletconfig that is required due to the additional 21,400 pods on the cluster:
      --------------------
      spec:
      kubeletConfig:
      kubeAPIBurst: 200
      kubeAPIQPS: 100
      maxPods: 500
      machineConfigPoolSelector:
      matchLabels:
      custom-kubelet: enabled
      --------------------

      so to sum up the above config, we are allowed to take down up to 10 nodes at a time, and migrate 4 VMs per node and up to 20 parallel VMs in total:

      as far as 10 nodes at a time - no issues there:
      --------------------
      worker064 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      worker065 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      worker069 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      worker072 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      worker073 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      worker075 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      worker077 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      worker083 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      worker088 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      worker090 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
      --------------------

      however, when it comes to the migration, I have measured how many VMs we actually migrate at any given time through the migration of 1000 VMs:

      ----------------------------------------------------

           
      Migrating time spent time spent in percentage

      ----------------------------------------------------

           
      1 952 10.33%

      ----------------------------------------------------

           
      2 748 8.11%

      ----------------------------------------------------

           
      3 1250 13.56%

      ----------------------------------------------------

           
      4 874 9.48%

      ----------------------------------------------------

           
      5 886 9.61%

      ----------------------------------------------------

           
      6 413 4.48%

      ----------------------------------------------------

           
      7 361 3.92%

      ----------------------------------------------------

           
      8 189 2.05%

      ----------------------------------------------------

           
      9 143 1.55%

      ----------------------------------------------------

           
      10 109 1.18%

      ----------------------------------------------------

           
      11 38 0.41%

      ----------------------------------------------------

           
      12 7 0.08%

      ----------------------------------------------------

           
      13 24 0.26%

      ----------------------------------------------------

           
      Doing nothing 3226 34.99%

      ----------------------------------------------------

      As you can see 55% of the time we actually migrate up to 6 VMs in parallel.
      (35% Doing nothing - mostly due to nodes going down for reboots but not only)

      Versions of all relevant components:
      CNV 4.9.2
      RHCS 5.0
      OCP 4.9.15

      CNV must-gather:
      -----------------
      http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather.migration-misbehaving.tar.gz

      Attachments

        Activity

          People

            ibezukh Igor Bezukh
            bbenshab Boaz Ben Shabat
            Sarah Bennert Sarah Bennert
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: