Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 4.18.z
Affects Version/s: 4.14
Component/s: Windows Containers
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
None
Regression:
No

Target Backport Versions:

4.17.0
Target Version:

4.17.z
Release Blocker:
None
Sprint:
WINC - Sprint 257, WINC - Sprint 258, WINC - Sprint 259
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Ran into this issue while testing a specific condition of node drain and reboot. After node drain and reboot, the windows node did not get back to ready state. Windows instance config daemon error has been shared at the end. It fails to start kubelet

OCP 4.14 AWS IPI WMCO 9.0.2
Node created via machinesets
Windows 2022 build: 10.0.20348.2527
AWS Region: eu-west-2 (just incase)

Sequence of events

1. Two windows 2022 node with one pod running on the node ip-10-0-93-39.eu-west-2.compute.internal

[azure@ceph1 openshift-4.14]$ oc get nodes -o wide |grep 2022
ip-10-0-89-241.eu-west-2.compute.internal   Ready    worker   112s   v1.27.12+7bee54d   10.0.89.241   <none>  Windows Server 2022 Datacenter    10.0.20348.2527  containerd://1.7.6
ip-10-0-93-39.eu-west-2.compute.internal    Ready    worker   28m    v1.27.12+7bee54d   10.0.93.39    <none>  Windows Server 2022 Datacenter    10.0.20348.2527  containerd://1.7.6
[azure@ceph1 openshift-4.14]$ oc get pods -o wide
NAME                                    READY   STATUS    RESTARTS   AGE   IP           NODE                                       
win-webserverlog-2022-db4748d47-m7z5k   1/1     Running   0          32m   10.132.3.2   ip-10-0-93-39.eu-west-2.compute.internal

2. Draining the node ip-10-0-93-39.eu-west-2.compute.internal where the pod is running

[azure@ceph1 openshift-4.14]$ oc adm drain ip-10-0-93-39.eu-west-2.compute.internal
node/ip-10-0-93-39.eu-west-2.compute.internal cordoned
evicting pod default/win-webserverlog-2022-db4748d47-m7z5k
pod/win-webserverlog-2022-db4748d47-m7z5k evicted
node/ip-10-0-93-39.eu-west-2.compute.internal drained

3. Pod is scheduled on the other node and the node ip-10-0-93-39.eu-west-2.compute.internal is marked as Ready,Schedulding disabled as Expected

[azure@ceph1 openshift-4.14]$ oc get nodes -o wide |grep 2022
ip-10-0-89-241.eu-west-2.compute.internal   Ready                      worker  5m52s   v1.27.12+7bee54d   10.0.89.241 Windows Server 2022 Datacenter 10.0.20348.2527 containerd://1.7.6
ip-10-0-93-39.eu-west-2.compute.internal    Ready,SchedulingDisabled   worker  32m     v1.27.12+7bee54d   10.0.93.39  Windows Server 2022 Datacenter 10.0.20348.2527 containerd://1.7.6
[azure@ceph1 openshift-4.14]$ oc get pods -o wide
NAME                                    READY   STATUS    RESTARTS   AGE   IP           NODE                                        NOMINATED NODE   READINESS GATES
win-webserverlog-2022-db4748d47-24s5q   1/1     Running   0          19m   10.132.4.2   ip-10-0-89-241.eu-west-2.compute.internal   <none>           <none>

3. Windows node ip-10-0-93-39.eu-west-2.compute.internal is rebooted using shutdown /r

sh-5.1# date
Fri Jun 28 05:31:14 UTC 2024
--
--
administrator@EC2AMAZ-7IEUS97 C:\Users\Administrator>shutdown /r

4. After reboot the node is up, but even after 1 hour minutes, node does not flip to Ready/schedulingdisabled

sh-5.1# date
Fri Jun 28 06:43:15 UTC 2024

[azure@ceph1 ~]$ oc get nodes -o wide |grep 2022
ip-10-0-93-39.eu-west-2.compute.internal    NotReady,SchedulingDisabled   worker  125m   v1.27.12+7bee54d   10.0.93.39 Windows Server 2022 Datacenter  10.0.20348.2527                containerd://Unknown

Node uncordon does not make any difference

ip-10-0-93-39.eu-west-2.compute.internal    NotReady   worker  126m   v1.27.12+7bee54d   10.0.93.39 Windows Server 2022 Datacenter  10.0.20348.2527    containerd://Unknown

5. with Wicd daemon of the node, we notice the following error message which goes in loop

> controller="node" controllerGroup="" controllerKind="Node" Node="ip-10-0-93-39.eu-west-2.compute.internal" namespace="" name="ip-10-0-93-39.eu-west-2.compute.internal" reconcileID="c2794713-c8b6-44f2-89d2-bd8384163f23"
E0628 06:44:53.663316    2124 controller.go:324] "Reconciler error" err=<
        could not resolve PowerShell variable HOSTNAME_OVERRIDE: error running command with output Invoke-RestMethod : Unable to connect to the remote server
        At line:1 char:1
        + Invoke-RestMethod -UseBasicParsing -Uri http://169.254.169.254/latest ...
        + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc
           eption
            + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand
        : exit status 1
 > controller="node" controllerGroup="" controllerKind="Node" Node="ip-10-0-93-39.eu-west-2.compute.internal" namespace="" name="ip-10-0-93-39.eu-west-2.compute.internal" reconcileID="73331b93-c171-4a47-b71f-ee45466087d9"

We can manually start the services like kublet,kube-proxy etc and node gets to a ready state

Version-Release number of selected component (if applicable):

    OCP 4.14/WMCO 9.0.2/AWS

How reproducible:

    Always

Steps to Reproduce:

    1. drain a node  and reboot
    2. Check whether the node gets to a ready state
    3.

Actual results:

    Windows services are not started due to which Node does not flip to  Ready state

Expected results:

    Node should rejoin to the cluster without any manual steps

Additional info:

links to

openshift/windows-machine-config-operator#2291: OCPBUGS-36305: Document grateful Node reboot and add workaround for issue in AWS

openshift/windows-machine-config-operator#2433: OCPBUGS-36305: Remove HNS networks before rebooting in AWS

Assignee:: J V

Reporter:: Ranjith Rajaram

Need Info From:: None

Contributors:: None

QA Contact:: Aharon Rasouli

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/06/28 6:53 AM

Updated:: 2025/07/22 11:31 AM

Resolved:: 2025/03/19 2:41 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates