Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8996

WMCO upgrade strategy fails to upgrade one node at a time

XMLWordPrintable

    • Important
    • None
    • 3
    • WINC - Sprint 243, WINC - Sprint 244, WINC - Sprint 245
    • 3
    • Rejected
    • Unspecified
    • Hide
      *Cause*: Lack of synchronization for machine and BYOH instance reconciliation events
      *Consequence*: machine and BYOH instance reconcile simultaneously, affecting running workloads.
      *Fix*: introduce a locking mechanism so that only one node is allowed to upgrade concurrently
      *Result*: machine and BYOH instances reconcile one at a time.
      Show
      *Cause*: Lack of synchronization for machine and BYOH instance reconciliation events *Consequence*: machine and BYOH instance reconcile simultaneously, affecting running workloads. *Fix*: introduce a locking mechanism so that only one node is allowed to upgrade concurrently *Result*: machine and BYOH instances reconcile one at a time.
    • Bug Fix

      Description of problem:
      When WMCO is upgraded, a reconciliation workflow is triggered to ensure the existing Windows Nodes are up to date with the new version. As the upgrade process makes Nodes unschedulable for a period, only one node at a time should be upgraded, in order to maintain as high availability as possible for Windows workloads.

      There are two related issues occurring here:

      When there are multiple Machine Nodes to upgrade. The WMCO Machine controller will upgrade them sequentially, however if there is an error during the upgrade process, upgrading that Machine will stop, and the Machine will be moved to the end of the queue. This can continue until all Machines are partially upgraded, and unusable.

      The second issue is that if a cluster has BYOH nodes and Machine nodes, it is possible for a BYOH node and a Machine node to go through the upgrade process at the same time, as the two controllers run concurrently.

      Both of this issues are caused by WMCO not keeping track of when a Node is currently mid-upgrade.

      Version-Release number of selected component (if applicable):
      OCP 4.12

      How reproducible:
      Always

      Steps to Reproduce:
      1. Install a previous version of WMCO
      2. Create a Windows MachineSet
      3. Add a BYOH Windows Node to the cluster
      4. Allow WMCO to configure both Windows machines as nodes
      5. Upgrade WMCO to the latest version

      Actual results:
      The nodes will be upgraded at the same time, with multiple Nodes having their desired version annotation changed at the same time.

      Expected results:
      WMCO upgrades one node at a time.

       

      QE Instructions:

      • Install a previous version of WMCO
      • Create BYOH and machine instances, at least 3 in total. Ideally,  4 total  two (2) BYOH and two(2) Windows machines from different machineSet
      • Upgrade WMCO with:
        •  operator-sdk
        • or uninstall the previous version and install a new version. Note: Do not delete the WMCO namespace to avoid losing the NS resources, e.g. the windows-instance configMap. 
      • Check Windows nodes are reconciled one at a time, i.e. there should be only one Windows node marked as not ready during the upgrade. 
      • Check and collect WMCO logs

      In addition, you can deploy a deamonSet with a Windows webserver, expose a service, and constantly curl the service for disruption so that the curl command fails if there is no Windows node to service the request. 

      QE notes:

      • To test this scenario the min recommended number of Windows nodes is 3, where only one Windows node should perform the upgrade at a time. 
      • Expect a longer overall elapsed time in the upgrade since the process is now serial. 

              jvaldes@redhat.com Jose Valdes
              jvaldes@redhat.com Jose Valdes
              Aharon Rasouli Aharon Rasouli
              Red Hat Employee
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: