Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: CNV v4.12.0
Affects Version/s: None
Component/s: CNV Virtualization
Labels:
- Scale
- UpgradeBlocker
- cnv-4+
- cnvbugsm
- devel_ack+
- pm_ack+
- qa_ack+

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
CLOSED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2069098
Bugzilla Bug:
RHBZ: 2069098

Sprint:
CNV Virtualization Sprint 225, CNV Virtualization Sprint 226, CNV Virtualization Sprint 227, CNV Virtualization Sprint 228, CNV Virtualization Sprint 229
Severity:
Important

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Some background:
-------------------------
I'm running a scale OpenShift setup with 100 OpenShift nodes as a preparation for an environment that was requested by a customer, with 47 RHCS 5.0 hosts as an external storage cluster, however, the migration was on the slow side, so I played around with the tweaks mentioned below to find the fastest & safest config
to complete such migrations, but no matter what I did, the total migration time from start to finish was very similar which was odd.

this setup is currently running 3000 VMs:
1500 RHEL 8.5 persistent storage VMs
500 Windows10 persistent storage VMs.
1000 Fedora Ephemeral storage VMs.

The workers are divided to 3 zones:
worker000 - worker031. = Zone0
worker032 - worker062. = Zone1
worker033 - worker096. = Zone2

I start the migration by applying an empty machineconfig to zone-2
which then causes the nodes to start draining.

this is my migration config:
--------------------
liveMigrationConfig:
completionTimeoutPerGiB: 800
parallelMigrationsPerCluster: 20
parallelOutboundMigrationsPerNode: 4
progressTimeout: 150
workloads: {}
--------------------

this is zone-2 config:
--------------------
maxUnavailable: 10
nodeSelector:
matchLabels:
node-role.kubernetes.io/zone-2: ""
--------------------

another thing worth mentioning is that I'm running a custom kubletconfig that is required due to the additional 21,400 pods on the cluster:
--------------------
spec:
kubeletConfig:
kubeAPIBurst: 200
kubeAPIQPS: 100
maxPods: 500
machineConfigPoolSelector:
matchLabels:
custom-kubelet: enabled
--------------------

so to sum up the above config, we are allowed to take down up to 10 nodes at a time, and migrate 4 VMs per node and up to 20 parallel VMs in total:

as far as 10 nodes at a time - no issues there:
--------------------
worker064 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker065 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker069 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker072 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker073 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker075 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker077 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker083 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker088 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
worker090 Ready,SchedulingDisabled worker,zone-2 19d v1.22.3+e790d7f
--------------------

however, when it comes to the migration, I have measured how many VMs we actually migrate at any given time through the migration of 1000 VMs:

----------------------------------------------------


Migrating	time spent	time spent in percentage

----------------------------------------------------


1	952	10.33%

----------------------------------------------------


2	748	8.11%

----------------------------------------------------


3	1250	13.56%

----------------------------------------------------


4	874	9.48%

----------------------------------------------------


5	886	9.61%

----------------------------------------------------


6	413	4.48%

----------------------------------------------------


7	361	3.92%

----------------------------------------------------


8	189	2.05%

----------------------------------------------------


9	143	1.55%

----------------------------------------------------


10	109	1.18%

----------------------------------------------------


11	38	0.41%

----------------------------------------------------


12	7	0.08%

----------------------------------------------------


13	24	0.26%

----------------------------------------------------


Doing nothing	3226	34.99%

----------------------------------------------------

As you can see 55% of the time we actually migrate up to 6 VMs in parallel.
(35% Doing nothing - mostly due to nodes going down for reboots but not only)

Versions of all relevant components:
CNV 4.9.2
RHCS 5.0
OCP 4.9.15

CNV must-gather:
-----------------
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather.migration-misbehaving.tar.gz