Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36305

Windows Node does not become ready after a node drain and reboot

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 4.18.z
    • 4.14
    • Windows Containers
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • No
    • None
    • WINC - Sprint 257, WINC - Sprint 258, WINC - Sprint 259
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Ran into this issue while testing a specific condition of node drain and reboot. After node drain and reboot, the windows node did not get back to ready state. Windows instance config daemon error has been shared at the end. It fails to start kubelet

      OCP 4.14 AWS IPI WMCO 9.0.2
      Node created via machinesets
      Windows 2022 build: 10.0.20348.2527
      AWS Region: eu-west-2 (just incase)

      Sequence of events

      1. Two windows 2022 node with one pod running on the node ip-10-0-93-39.eu-west-2.compute.internal

       

      [azure@ceph1 openshift-4.14]$ oc get nodes -o wide |grep 2022
      ip-10-0-89-241.eu-west-2.compute.internal   Ready    worker   112s   v1.27.12+7bee54d   10.0.89.241   <none>  Windows Server 2022 Datacenter    10.0.20348.2527  containerd://1.7.6
      ip-10-0-93-39.eu-west-2.compute.internal    Ready    worker   28m    v1.27.12+7bee54d   10.0.93.39    <none>  Windows Server 2022 Datacenter    10.0.20348.2527  containerd://1.7.6
      [azure@ceph1 openshift-4.14]$ oc get pods -o wide
      NAME                                    READY   STATUS    RESTARTS   AGE   IP           NODE                                       
      win-webserverlog-2022-db4748d47-m7z5k   1/1     Running   0          32m   10.132.3.2   ip-10-0-93-39.eu-west-2.compute.internal   
      

       

      2. Draining the node ip-10-0-93-39.eu-west-2.compute.internal where the pod is running

      [azure@ceph1 openshift-4.14]$ oc adm drain ip-10-0-93-39.eu-west-2.compute.internal
      node/ip-10-0-93-39.eu-west-2.compute.internal cordoned
      evicting pod default/win-webserverlog-2022-db4748d47-m7z5k
      pod/win-webserverlog-2022-db4748d47-m7z5k evicted
      node/ip-10-0-93-39.eu-west-2.compute.internal drained

      3. Pod is scheduled on the other node and the node ip-10-0-93-39.eu-west-2.compute.internal is marked as Ready,Schedulding disabled as Expected

       

      [azure@ceph1 openshift-4.14]$ oc get nodes -o wide |grep 2022
      ip-10-0-89-241.eu-west-2.compute.internal   Ready                      worker  5m52s   v1.27.12+7bee54d   10.0.89.241 Windows Server 2022 Datacenter 10.0.20348.2527 containerd://1.7.6
      ip-10-0-93-39.eu-west-2.compute.internal    Ready,SchedulingDisabled   worker  32m     v1.27.12+7bee54d   10.0.93.39  Windows Server 2022 Datacenter 10.0.20348.2527 containerd://1.7.6
      [azure@ceph1 openshift-4.14]$ oc get pods -o wide
      NAME                                    READY   STATUS    RESTARTS   AGE   IP           NODE                                        NOMINATED NODE   READINESS GATES
      win-webserverlog-2022-db4748d47-24s5q   1/1     Running   0          19m   10.132.4.2   ip-10-0-89-241.eu-west-2.compute.internal   <none>           <none>
      

       

      3. Windows node ip-10-0-93-39.eu-west-2.compute.internal is rebooted using shutdown /r 

      sh-5.1# date
      Fri Jun 28 05:31:14 UTC 2024
      --
      --
      administrator@EC2AMAZ-7IEUS97 C:\Users\Administrator>shutdown /r

      4. After reboot the node is up, but even after 1 hour minutes, node does not flip to Ready/schedulingdisabled

      sh-5.1# date
      Fri Jun 28 06:43:15 UTC 2024
      [azure@ceph1 ~]$ oc get nodes -o wide |grep 2022
      ip-10-0-93-39.eu-west-2.compute.internal    NotReady,SchedulingDisabled   worker  125m   v1.27.12+7bee54d   10.0.93.39 Windows Server 2022 Datacenter  10.0.20348.2527                containerd://Unknown

      Node uncordon does not make any difference

      ip-10-0-93-39.eu-west-2.compute.internal    NotReady   worker  126m   v1.27.12+7bee54d   10.0.93.39 Windows Server 2022 Datacenter  10.0.20348.2527    containerd://Unknown

      5. with Wicd daemon of the node, we notice the following error message which goes in loop

       

      > controller="node" controllerGroup="" controllerKind="Node" Node="ip-10-0-93-39.eu-west-2.compute.internal" namespace="" name="ip-10-0-93-39.eu-west-2.compute.internal" reconcileID="c2794713-c8b6-44f2-89d2-bd8384163f23"
      E0628 06:44:53.663316    2124 controller.go:324] "Reconciler error" err=<
              could not resolve PowerShell variable HOSTNAME_OVERRIDE: error running command with output Invoke-RestMethod : Unable to connect to the remote server
              At line:1 char:1
              + Invoke-RestMethod -UseBasicParsing -Uri http://169.254.169.254/latest ...
              + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                  + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc
                 eption
                  + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand
              : exit status 1
       > controller="node" controllerGroup="" controllerKind="Node" Node="ip-10-0-93-39.eu-west-2.compute.internal" namespace="" name="ip-10-0-93-39.eu-west-2.compute.internal" reconcileID="73331b93-c171-4a47-b71f-ee45466087d9"
      

       

      We can manually start the services like kublet,kube-proxy etc and node gets to a ready state

      Version-Release number of selected component (if applicable):

          OCP 4.14/WMCO 9.0.2/AWS

      How reproducible:

          Always

      Steps to Reproduce:

          1. drain a node  and reboot
          2. Check whether the node gets to a ready state
          3.
          

      Actual results:

          Windows services are not started due to which Node does not flip to  Ready state

      Expected results:

          Node should rejoin to the cluster without any manual steps

      Additional info:

          

              jvaldes@redhat.com Jose Valdes
              rhn-support-rrajaram Ranjith Rajaram
              None
              None
              Aharon Rasouli Aharon Rasouli
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: