-
Bug
-
Resolution: Done
-
Critical
-
4.12.z, 4.12
-
None
-
No
-
3
-
WINC - Sprint 237, WINC - Sprint 238
-
2
-
False
-
Description of problem:
When having a OCP cluster with Windows BYOH nodes on 4.12.0 GA release, the seamless upgrade from WMCO 7.0.1 to WMCO 7.1.0 does not work. Hanging while trying to upgrade the BYOH nodes displaying the following error in WMCO: {"level":"info","ts":"2023-05-30T10:49:09Z","logger":"nc 10.0.128.6","msg":"Unable to mark node as NotReady","error":"error running powershell.exe -NonInteractive -ExecutionPolicy Bypass \"C:\\k\\windows-instance-config-daemon.exe cleanup --api-server https://api-int.jfrancoa-3005.qe.gcp.devcluster.openshift.com:6443 --sa-ca C:\\k\\sa-ca.crt --sa-token C:\\k\\sa-token --namespace openshift-windows-machine-config-operator\": Process exited with status 1"} {"level":"error","ts":"2023-05-30T10:49:09Z","msg":"Reconciler error","controller":"configmap","controllerGroup":"","controllerKind":"ConfigMap","configMap":{"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"},"namespace":"openshift-windows-machine-config-operator","name":"windows-instances","reconcileID":"08536e66-5946-49ce-b225-fd5887089d63","error":"error configuring host with address 10.0.128.6: error waiting for proper windowsmachineconfig.openshift.io/version annotation for node byoh-winc-1.c.openshift-qe.internal: timeout waiting for windowsmachineconfig.openshift.io/version and windowsmachineconfig.openshift.io/desired-version annotations to match on node byoh-winc-1.c.openshift-qe.internal: timed out waiting for the condition","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"} The BYOH nodes get stucked in NotReady,SchedulingDisabled and the upgrade does not move on: $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME byoh-winc-1.c.openshift-qe.internal NotReady,SchedulingDisabled worker 82m v1.25.8+37a9a08 10.0.128.6 <none> Windows Server 2022 Datacenter 10.0.20348.1726 containerd://1.6.19-4-geab1c5444 jfrancoa-3005-dvj5w-master-0.c.openshift-qe.internal Ready control-plane,master 5h21m v1.25.4+77bec7a 10.0.0.5 <none> Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa) 4.18.0-372.40.1.el8_6.x86_64 cri-o://1.25.1-5.rhaos4.12.git6005903.el8 jfrancoa-3005-dvj5w-master-1.c.openshift-qe.internal Ready control-plane,master 5h20m v1.25.4+77bec7a 10.0.0.3 <none> Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa) 4.18.0-372.40.1.el8_6.x86_64 cri-o://1.25.1-5.rhaos4.12.git6005903.el8 jfrancoa-3005-dvj5w-master-2.c.openshift-qe.internal Ready control-plane,master 5h20m v1.25.4+77bec7a 10.0.0.4 <none> Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa) 4.18.0-372.40.1.el8_6.x86_64 cri-o://1.25.1-5.rhaos4.12.git6005903.el8 jfrancoa-3005-dvj5w-windows-worker-a-7qjgp Ready worker 80m v1.25.8+37a9a08 10.0.128.8 <none> Windows Server 2022 Datacenter 10.0.20348.1726 containerd://1.6.19-4-geab1c5444 jfrancoa-3005-dvj5w-windows-worker-a-s9n4d Ready worker 71m v1.25.8+37a9a08 10.0.128.9 <none> Windows Server 2022 Datacenter 10.0.20348.1726 containerd://1.6.19-4-geab1c5444 jfrancoa-3005-dvj5w-worker-a-f9bwh.c.openshift-qe.internal Ready worker 5h10m v1.25.4+77bec7a 10.0.128.2 <none> Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa) 4.18.0-372.40.1.el8_6.x86_64 cri-o://1.25.1-5.rhaos4.12.git6005903.el8 jfrancoa-3005-dvj5w-worker-b-wf59d.c.openshift-qe.internal Ready worker 5h10m v1.25.4+77bec7a 10.0.128.3 <none> Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa) 4.18.0-372.40.1.el8_6.x86_64 cri-o://1.25.1-5.rhaos4.12.git6005903.el8 When upgrading the OCP payload from 4.12.18 or 4.12.19, then the error dissapears and the upgrade succeeds (if done before the WMCO upgrade or even when the cluster is hanging in this state).
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True False 4h54m Cluster version is 4.12.0 $ oc get cm -n openshift-windows-machine-config-operator NAME DATA AGE kube-root-ca.crt 1 4h55m openshift-service-ca.crt 1 4h55m windows-instances 2 122m windows-machine-config-operator-lock 0 88m windows-services-7.1.0-3eaf2b6 2 88m Windows nodes kernel version: byoh-winc-0.c.openshift-qe.internal 10.0.20348.1726 byoh-winc-1.c.openshift-qe.internal 10.0.20348.1726 jfrancoa-3005-dvj5w-windows-worker-a-7qjgp 10.0.20348.1726 jfrancoa-3005-dvj5w-worker-a-f9bwh.c.openshift-qe.internal 10.0.20348.1726
How reproducible:
Always. It was repruced in Platform:None and GCP.
Steps to Reproduce:
1. Deploy an OCP 4.12.0 (https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.12.0) cluster. Install WMCO 7.0.1 and create a BYOH Windows node. 2. Upgrade WMCO from 7.0.1 to WMCO 7.1.0 3. Wait for the upgrade to finish
Actual results:
The upgrade doesn't succeed, leaving the BYOH nodes in NotReady,SchedulingDisabled. No workloads can run on the BYOH nodes.
Expected results:
Upgrade to WMCO 7.1.0 succeeds
Additional info: