-
Bug
-
Resolution: Done
-
Major
-
4.14
-
None
Description of problem:
Ran into this issue while testing a specific condition of node drain and reboot. After node drain and reboot, the windows node did not get back to ready state. Windows instance config daemon error has been shared at the end. It fails to start kubelet
OCP 4.14 AWS IPI WMCO 9.0.2
Node created via machinesets
Windows 2022 build: 10.0.20348.2527
AWS Region: eu-west-2 (just incase)
Sequence of events
1. Two windows 2022 node with one pod running on the node ip-10-0-93-39.eu-west-2.compute.internal
[azure@ceph1 openshift-4.14]$ oc get nodes -o wide |grep 2022 ip-10-0-89-241.eu-west-2.compute.internal Ready worker 112s v1.27.12+7bee54d 10.0.89.241 <none> Windows Server 2022 Datacenter 10.0.20348.2527 containerd://1.7.6 ip-10-0-93-39.eu-west-2.compute.internal Ready worker 28m v1.27.12+7bee54d 10.0.93.39 <none> Windows Server 2022 Datacenter 10.0.20348.2527 containerd://1.7.6 [azure@ceph1 openshift-4.14]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE win-webserverlog-2022-db4748d47-m7z5k 1/1 Running 0 32m 10.132.3.2 ip-10-0-93-39.eu-west-2.compute.internal
2. Draining the node ip-10-0-93-39.eu-west-2.compute.internal where the pod is running
[azure@ceph1 openshift-4.14]$ oc adm drain ip-10-0-93-39.eu-west-2.compute.internal
node/ip-10-0-93-39.eu-west-2.compute.internal cordoned
evicting pod default/win-webserverlog-2022-db4748d47-m7z5k
pod/win-webserverlog-2022-db4748d47-m7z5k evicted
node/ip-10-0-93-39.eu-west-2.compute.internal drained
3. Pod is scheduled on the other node and the node ip-10-0-93-39.eu-west-2.compute.internal is marked as Ready,Schedulding disabled as Expected
[azure@ceph1 openshift-4.14]$ oc get nodes -o wide |grep 2022 ip-10-0-89-241.eu-west-2.compute.internal Ready worker 5m52s v1.27.12+7bee54d 10.0.89.241 Windows Server 2022 Datacenter 10.0.20348.2527 containerd://1.7.6 ip-10-0-93-39.eu-west-2.compute.internal Ready,SchedulingDisabled worker 32m v1.27.12+7bee54d 10.0.93.39 Windows Server 2022 Datacenter 10.0.20348.2527 containerd://1.7.6 [azure@ceph1 openshift-4.14]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES win-webserverlog-2022-db4748d47-24s5q 1/1 Running 0 19m 10.132.4.2 ip-10-0-89-241.eu-west-2.compute.internal <none> <none>
3. Windows node ip-10-0-93-39.eu-west-2.compute.internal is rebooted using shutdown /r
sh-5.1# date Fri Jun 28 05:31:14 UTC 2024 -- -- administrator@EC2AMAZ-7IEUS97 C:\Users\Administrator>shutdown /r
4. After reboot the node is up, but even after 1 hour minutes, node does not flip to Ready/schedulingdisabled
sh-5.1# date Fri Jun 28 06:43:15 UTC 2024
[azure@ceph1 ~]$ oc get nodes -o wide |grep 2022
ip-10-0-93-39.eu-west-2.compute.internal NotReady,SchedulingDisabled worker 125m v1.27.12+7bee54d 10.0.93.39 Windows Server 2022 Datacenter 10.0.20348.2527 containerd://Unknown
Node uncordon does not make any difference
ip-10-0-93-39.eu-west-2.compute.internal NotReady worker 126m v1.27.12+7bee54d 10.0.93.39 Windows Server 2022 Datacenter 10.0.20348.2527 containerd://Unknown
5. with Wicd daemon of the node, we notice the following error message which goes in loop
> controller="node" controllerGroup="" controllerKind="Node" Node="ip-10-0-93-39.eu-west-2.compute.internal" namespace="" name="ip-10-0-93-39.eu-west-2.compute.internal" reconcileID="c2794713-c8b6-44f2-89d2-bd8384163f23" E0628 06:44:53.663316 2124 controller.go:324] "Reconciler error" err=< could not resolve PowerShell variable HOSTNAME_OVERRIDE: error running command with output Invoke-RestMethod : Unable to connect to the remote server At line:1 char:1 + Invoke-RestMethod -UseBasicParsing -Uri http://169.254.169.254/latest ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc eption + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand : exit status 1 > controller="node" controllerGroup="" controllerKind="Node" Node="ip-10-0-93-39.eu-west-2.compute.internal" namespace="" name="ip-10-0-93-39.eu-west-2.compute.internal" reconcileID="73331b93-c171-4a47-b71f-ee45466087d9"
We can manually start the services like kublet,kube-proxy etc and node gets to a ready state
Version-Release number of selected component (if applicable):
OCP 4.14/WMCO 9.0.2/AWS
How reproducible:
Always
Steps to Reproduce:
1. drain a node and reboot 2. Check whether the node gets to a ready state 3.
Actual results:
Windows services are not started due to which Node does not flip to Ready state
Expected results:
Node should rejoin to the cluster without any manual steps
Additional info: