Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5803

Windows nodes do not get drained (deconfigure) during the upgrade process

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • 4.12.0
    • 4.12
    • Windows Containers
    • None
    • 0
    • WINC - Sprint 230
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-5732. The following is the description of the original issue:

      Description of problem:

      
      During the validation of [OCPBUGS-4247|https://issues.redhat.com/browse/OCPBUGS-4247] I observed that all the workloads remained in the same Windows worker node during the whole upgrade process, however if an upgrade is taking place the kubelet will be impacted by such an upgrade so the workloads need to be move away from that node before the reconfiguration of the node occurss.
      This was confirmed by Mohammad:
      _I think I found the cause btw. For machine nodes, we don't try to find an instance associated with the machine being reconciled, instead just initializing the instanceInfo with a nil node. So when we check if an upgrade is required (i.e. should we deconfigure), we get false_
      
      This behavior was included as part of the bug:  [OCPBUGS-3506|https://issues.redhat.com/browse/OCPBUGS-3506] , which got cherry-picked into 4.12 and 4.11 too, therefore this bug impacts all those versions.
      
      Adding wmco logs as well as the traces which confirm that none of the workloads are moving out from the nodes to be reconfigured.
      
      

      Version-Release number of selected component (if applicable):

      
      

      How reproducible:

      
      

      Steps to Reproduce:

      1. Deploy a IPI cluster with Windows workers. Create some workloads for those Windows workers
      2. Perform and upgrade or simply modify the version annotation of each of the worker nodes
      3. Wait for WMCO to reconfigure (or upgrade) all the windows workers. Keep track on where those workers are landing, yo can use the following snippet for it:
      
      lb=`oc get svc -l app=win-webserver -n winc-test -o=jsonpath="{.items[0].status.loadBalancer.ingress[0].hostname}"`;file=/tmp/35707_AWS_412.log;for i in {1..60}; do time=`date`; echo -e "\n#######ATTEMTP #${i} ${time}  ######" &>> $file;oc get nodes -l=node.openshift.io/os_id="Windows" &>> $file;oc get pods -n winc-test -o wide &>> $file;curl --connect-timeout 60 $lb &>> $file;sleep 60; done
      
      

      Actual results:

      None of the Windows nodes get drained during the upgrade. Workloads remain in the same node which got reconfigured.
      

      Expected results:

      The Windows nodes get drained during the upgrade, right before WMCO reconfigures them.
      

      Additional info:

      
      

            mohashai Mohammad Shaikh
            openshift-crt-jira-prow OpenShift Prow Bot
            Aharon Rasouli Aharon Rasouli
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: