-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
False
-
-
False
-
CLOSED
-
CNV Virtualization Sprint 225, CNV Virtualization Sprint 226, CNV Virtualization Sprint 227, CNV Virtualization Sprint 228, CNV Virtualization Sprint 229
-
High
-
None
Some background:
-------------------------
I'm running a scale OpenShift setup with 100 OpenShift nodes as a preparation for an environment that was requested by a customer, with 47 RHCS 5.0 hosts as an external storage cluster, however, the migration was on the slow side, so I played around with the tweaks mentioned below to find the fastest & safest config
to complete such migrations, but no matter what I did, the total migration time from start to finish was very similar which was odd.
this setup is currently running 3000 VMs:
1500 RHEL 8.5 persistent storage VMs
500 Windows10 persistent storage VMs.
1000 Fedora Ephemeral storage VMs.
The workers are divided to 3 zones:
worker000 - worker031. = Zone0
worker032 - worker062. = Zone1
worker033 - worker096. = Zone2
I start the migration by applying an empty machineconfig to zone-2
which then causes the nodes to start draining.
this is my migration config:
--------------------
liveMigrationConfig:
completionTimeoutPerGiB: 800
parallelMigrationsPerCluster: 20
parallelOutboundMigrationsPerNode: 4
progressTimeout: 150
workloads: {}
--------------------
this is zone-2 config:
--------------------
maxUnavailable: 10
nodeSelector:
matchLabels:
node-role.kubernetes.io/zone-2: ""
--------------------
another thing worth mentioning is that I'm running a custom kubletconfig that is required due to the additional 21,400 pods on the cluster:
--------------------
spec:
kubeletConfig:
kubeAPIBurst: 200
kubeAPIQPS: 100
maxPods: 500
machineConfigPoolSelector:
matchLabels:
custom-kubelet: enabled
--------------------
so to sum up the above config, we are allowed to take down up to 10 nodes at a time, and migrate 4 VMs per node and up to 20 parallel VMs in total:
as far as 10 nodes at a time - no issues there:
--------------------
worker064 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker065 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker069 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker072 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker073 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker075 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker077 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker083 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker088 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker090 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
--------------------
however, when it comes to the migration, I have measured how many VMs we actually migrate at any given time through the migration of 1000 VMs:
----------------------------------------------------
Migrating | time spent | time spent in percentage |
----------------------------------------------------
1 | 952 | 10.33% |
----------------------------------------------------
2 | 748 | 8.11% |
----------------------------------------------------
3 | 1250 | 13.56% |
----------------------------------------------------
4 | 874 | 9.48% |
----------------------------------------------------
5 | 886 | 9.61% |
----------------------------------------------------
6 | 413 | 4.48% |
----------------------------------------------------
7 | 361 | 3.92% |
----------------------------------------------------
8 | 189 | 2.05% |
----------------------------------------------------
9 | 143 | 1.55% |
----------------------------------------------------
10 | 109 | 1.18% |
----------------------------------------------------
11 | 38 | 0.41% |
----------------------------------------------------
12 | 7 | 0.08% |
----------------------------------------------------
13 | 24 | 0.26% |
----------------------------------------------------
Doing nothing | 3226 | 34.99% |
----------------------------------------------------
As you can see 55% of the time we actually migrate up to 6 VMs in parallel.
(35% Doing nothing - mostly due to nodes going down for reboots but not only)
Versions of all relevant components:
CNV 4.9.2
RHCS 5.0
OCP 4.9.15
CNV must-gather:
-----------------
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather.migration-misbehaving.tar.gz