Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23020

WMCO upgrade strategy fails to upgrade one node at a time

    XMLWordPrintable

Details

    • Important
    • No
    • 0
    • WINC - Sprint 244, WINC - Sprint 245, WINC - Sprint 246, WINC - Sprint 248, WINC - Sprint 249, WINC - Sprint 250
    • 6
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      *Cause*: Lack of synchronization for machine and BYOH instance reconciliation events
      *Consequence*: machine and BYOH instance reconcile simultaneously, affecting running workloads.
      *Fix*: introduce a locking mechanism so that only one node is allowed to upgrade concurrently
      *Result*: machine and BYOH instances reconcile one at a time.
      Show
      *Cause*: Lack of synchronization for machine and BYOH instance reconciliation events *Consequence*: machine and BYOH instance reconcile simultaneously, affecting running workloads. *Fix*: introduce a locking mechanism so that only one node is allowed to upgrade concurrently *Result*: machine and BYOH instances reconcile one at a time.
    • Bug Fix
    • In Progress

    Description

      This is a clone of issue OCPBUGS-23016. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-22984. The following is the description of the original issue:

      Description of problem:
      When WMCO is upgraded, a reconciliation workflow is triggered to ensure the existing Windows Nodes are up to date with the new version. As the upgrade process makes Nodes unschedulable for a period, only one node at a time should be upgraded, in order to maintain as high availability as possible for Windows workloads.

      There are two related issues occurring here:

      When there are multiple Machine Nodes to upgrade. The WMCO Machine controller will upgrade them sequentially, however if there is an error during the upgrade process, upgrading that Machine will stop, and the Machine will be moved to the end of the queue. This can continue until all Machines are partially upgraded, and unusable.

      The second issue is that if a cluster has BYOH nodes and Machine nodes, it is possible for a BYOH node and a Machine node to go through the upgrade process at the same time, as the two controllers run concurrently.

      Both of this issues are caused by WMCO not keeping track of when a Node is currently mid-upgrade.

      Version-Release number of selected component (if applicable):
      OCP 4.12

      How reproducible:
      Always

      Steps to Reproduce:
      1. Install a previous version of WMCO
      2. Create a Windows MachineSet
      3. Add a BYOH Windows Node to the cluster
      4. Allow WMCO to configure both Windows machines as nodes
      5. Upgrade WMCO to the latest version

      Actual results:
      The nodes will be upgraded at the same time, with multiple Nodes having their desired version annotation changed at the same time.

      Expected results:
      WMCO upgrades one node at a time.

       

      QE notes:

      • To test this scenario the min recommended number of Windows nodes is 3, where only one Windows node should perform the upgrade at a time. 
      • Expect a longer overall elapsed time in the upgrade since the process is now serial. 

      Attachments

        Issue Links

          Activity

            People

              jvaldes@redhat.com Jose Valdes
              openshift-crt-jira-prow OpenShift Prow Bot
              Aharon Rasouli Aharon Rasouli
              Red Hat Employee
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: