-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
None
-
None
-
False
-
-
False
-
---
-
---
-
-
No
This is not really a bug report, but I'd like to ask you to review what the product is doing by default during CNV upgrades.
Yesterday, I got my second support case that the following happens:
- CNV upgrades automatically, at a random not scheduled hour for the customer
- The default workloadUpdateStrategy settings trigger migrations of all VMs
- Migrations get stuck, unable to complete, as the VMs are busy with live workload
Don't get me wrong, the automatic upgrades and the logic around workloadUpdateStrategy are excellent on their own. The problem is this is triggered with the defaults we ship and there is no customer "ack" to start the migrations. This is causing some unwanted side effects for customers.
Yesterday's case the customer shutdown the workload on the VMs to enable the migrations to complete and the migration storm to go away. The migrations were looping, just eating network bandwidth, without completing.
At the moment, with default settings, the VMs performance are not really impacted, just the network is flooded. But if for example post-copy or some other tuneable is enabled in the future, then the VMs can have substantially lower performance during the migration and disrupt the workloads.
I'm afraid this mechanism as it stands will cause us more and also more severe problems than the ones seen so far.
It's easy to prevent this from happening, but I'd guess every customer will have to hit it once to change the updates to manual or tune workloadUpdateStrategy.
Can you please take a second look at this, and maybe think of some improvement (default change, scheduled upgrade times, additional knob, customer confirmation etc...) to improve it? I'm afraid this will someday cause more issues.
- causes
-
CNV-35883 Enable defining schedule/acks/tuning for workloadUpdateStrategy
- New
- links to