-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.12.z
-
Important
-
None
-
3
-
WINC - Sprint 243, WINC - Sprint 244, WINC - Sprint 245
-
3
-
Rejected
-
Unspecified
-
-
Bug Fix
Description of problem:
When WMCO is upgraded, a reconciliation workflow is triggered to ensure the existing Windows Nodes are up to date with the new version. As the upgrade process makes Nodes unschedulable for a period, only one node at a time should be upgraded, in order to maintain as high availability as possible for Windows workloads.
There are two related issues occurring here:
When there are multiple Machine Nodes to upgrade. The WMCO Machine controller will upgrade them sequentially, however if there is an error during the upgrade process, upgrading that Machine will stop, and the Machine will be moved to the end of the queue. This can continue until all Machines are partially upgraded, and unusable.
The second issue is that if a cluster has BYOH nodes and Machine nodes, it is possible for a BYOH node and a Machine node to go through the upgrade process at the same time, as the two controllers run concurrently.
Both of this issues are caused by WMCO not keeping track of when a Node is currently mid-upgrade.
Version-Release number of selected component (if applicable):
OCP 4.12
How reproducible:
Always
Steps to Reproduce:
1. Install a previous version of WMCO
2. Create a Windows MachineSet
3. Add a BYOH Windows Node to the cluster
4. Allow WMCO to configure both Windows machines as nodes
5. Upgrade WMCO to the latest version
Actual results:
The nodes will be upgraded at the same time, with multiple Nodes having their desired version annotation changed at the same time.
Expected results:
WMCO upgrades one node at a time.
QE Instructions:
- Install a previous version of WMCO
- Create BYOH and machine instances, at least 3 in total. Ideally, 4 total two (2) BYOH and two(2) Windows machines from different machineSet
- Upgrade WMCO with:
- operator-sdk
- or uninstall the previous version and install a new version. Note: Do not delete the WMCO namespace to avoid losing the NS resources, e.g. the windows-instance configMap.
- Check Windows nodes are reconciled one at a time, i.e. there should be only one Windows node marked as not ready during the upgrade.
- Check and collect WMCO logs
In addition, you can deploy a deamonSet with a Windows webserver, expose a service, and constantly curl the service for disruption so that the curl command fails if there is no Windows node to service the request.
QE notes:
- To test this scenario the min recommended number of Windows nodes is 3, where only one Windows node should perform the upgrade at a time.
- Expect a longer overall elapsed time in the upgrade since the process is now serial.
- is cloned by
-
OCPBUGS-22984 WMCO upgrade strategy fails to upgrade one node at a time
- Closed
- is depended on by
-
OCPBUGS-22984 WMCO upgrade strategy fails to upgrade one node at a time
- Closed
- links to
-
RHBA-2023:120235 Red Hat OpenShift for Windows Containers 10.15.0 product release
- mentioned on