Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.12.z
Component/s: Windows Containers
Labels:
- migrated_from_bz

Severity:
Important
Regression:
None
Story Points:
3
Sprint:
WINC - Sprint 243, WINC - Sprint 244, WINC - Sprint 245
sprint_count:
3
Release Blocker:
Rejected
Architecture:

Unspecified
Release Note Text:

Hide
*Cause*: Lack of synchronization for machine and BYOH instance reconciliation events
*Consequence*: machine and BYOH instance reconcile simultaneously, affecting running workloads.
*Fix*: introduce a locking mechanism so that only one node is allowed to upgrade concurrently
*Result*: machine and BYOH instances reconcile one at a time.

Show
*Cause*: Lack of synchronization for machine and BYOH instance reconciliation events *Consequence*: machine and BYOH instance reconcile simultaneously, affecting running workloads. *Fix*: introduce a locking mechanism so that only one node is allowed to upgrade concurrently *Result*: machine and BYOH instances reconcile one at a time.
Release Note Type:
Bug Fix
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:
When WMCO is upgraded, a reconciliation workflow is triggered to ensure the existing Windows Nodes are up to date with the new version. As the upgrade process makes Nodes unschedulable for a period, only one node at a time should be upgraded, in order to maintain as high availability as possible for Windows workloads.

There are two related issues occurring here:

When there are multiple Machine Nodes to upgrade. The WMCO Machine controller will upgrade them sequentially, however if there is an error during the upgrade process, upgrading that Machine will stop, and the Machine will be moved to the end of the queue. This can continue until all Machines are partially upgraded, and unusable.

The second issue is that if a cluster has BYOH nodes and Machine nodes, it is possible for a BYOH node and a Machine node to go through the upgrade process at the same time, as the two controllers run concurrently.

Both of this issues are caused by WMCO not keeping track of when a Node is currently mid-upgrade.

Version-Release number of selected component (if applicable):
OCP 4.12

How reproducible:
Always

Steps to Reproduce:
1. Install a previous version of WMCO
2. Create a Windows MachineSet
3. Add a BYOH Windows Node to the cluster
4. Allow WMCO to configure both Windows machines as nodes
5. Upgrade WMCO to the latest version

Actual results:
The nodes will be upgraded at the same time, with multiple Nodes having their desired version annotation changed at the same time.

Expected results:
WMCO upgrades one node at a time.

QE Instructions:

Install a previous version of WMCO
Create BYOH and machine instances, at least 3 in total. Ideally, 4 total two (2) BYOH and two(2) Windows machines from different machineSet
Upgrade WMCO with:
- operator-sdk
- or uninstall the previous version and install a new version. Note: Do not delete the WMCO namespace to avoid losing the NS resources, e.g. the windows-instance configMap.
Check Windows nodes are reconciled one at a time, i.e. there should be only one Windows node marked as not ready during the upgrade.
Check and collect WMCO logs

In addition, you can deploy a deamonSet with a Windows webserver, expose a service, and constantly curl the service for disruption so that the curl command fails if there is no Windows node to service the request.

QE notes:

To test this scenario the min recommended number of Windows nodes is 3, where only one Windows node should perform the upgrade at a time.
Expect a longer overall elapsed time in the upgrade since the process is now serial.

is cloned by

OCPBUGS-22984 WMCO upgrade strategy fails to upgrade one node at a time

Closed

is depended on by

OCPBUGS-22984 WMCO upgrade strategy fails to upgrade one node at a time

Closed

links to

openshift/windows-machine-config-operator#1901: OCPBUGS-8996: Introduce `upgrading` label to block concurrent upgrades

RHBA-2023:120235 Red Hat OpenShift for Windows Containers 10.15.0 product release

mentioned on

Merge request - Updated US source to: d76978f Merge pull request #1901 from jrvaldes/OCPBUGS-8996-unhealthy

Assignee:: Jose Valdes

Reporter:: Jose Valdes

QA Contact:: Aharon Rasouli

Contributing Groups:: Red Hat Employee

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2021/10/06 5:19 AM

Updated:: 2024/02/27 3:16 PM

Resolved:: 2024/02/27 3:16 PM

Details

Description

QE notes:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates