Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.12.z
Affects Version/s: 4.12.z, 4.12
Component/s: Windows Containers
Labels:
None

Regression:
No
Story Points:
3
Sprint:
WINC - Sprint 237, WINC - Sprint 238
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.12.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When having a OCP cluster with Windows BYOH nodes on 4.12.0 GA release, the seamless upgrade from WMCO 7.0.1 to WMCO 7.1.0 does not work. Hanging while trying to upgrade the BYOH nodes displaying the following error in WMCO:
{"level":"info","ts":"2023-05-30T10:49:09Z","logger":"nc 10.0.128.6","msg":"Unable to mark node as NotReady","error":"error running powershell.exe -NonInteractive -ExecutionPolicy Bypass \"C:\\k\\windows-instance-config-daemon.exe cleanup --api-server https://api-int.jfrancoa-3005.qe.gcp.devcluster.openshift.com:6443 --sa-ca C:\\k\\sa-ca.crt --sa-token C:\\k\\sa-token --namespace openshift-windows-machine-config-operator\": Process exited with status 1"}
{"level":"error","ts":"2023-05-30T10:49:09Z","msg":"Reconciler error","controller":"configmap","controllerGroup":"","controllerKind":"ConfigMap","configMap":{"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"},"namespace":"openshift-windows-machine-config-operator","name":"windows-instances","reconcileID":"08536e66-5946-49ce-b225-fd5887089d63","error":"error configuring host with address 10.0.128.6: error waiting for proper windowsmachineconfig.openshift.io/version annotation for node byoh-winc-1.c.openshift-qe.internal: timeout waiting for windowsmachineconfig.openshift.io/version and windowsmachineconfig.openshift.io/desired-version annotations to match on node byoh-winc-1.c.openshift-qe.internal: timed out waiting for the condition","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

The BYOH nodes get stucked in NotReady,SchedulingDisabled and the upgrade does not move on:

$ oc get nodes -o wide
NAME                                                         STATUS                        ROLES                  AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
byoh-winc-1.c.openshift-qe.internal                          NotReady,SchedulingDisabled   worker                 82m     v1.25.8+37a9a08   10.0.128.6    <none>        Windows Server 2022 Datacenter                                  10.0.20348.1726                containerd://1.6.19-4-geab1c5444
jfrancoa-3005-dvj5w-master-0.c.openshift-qe.internal         Ready                         control-plane,master   5h21m   v1.25.4+77bec7a   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   cri-o://1.25.1-5.rhaos4.12.git6005903.el8
jfrancoa-3005-dvj5w-master-1.c.openshift-qe.internal         Ready                         control-plane,master   5h20m   v1.25.4+77bec7a   10.0.0.3      <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   cri-o://1.25.1-5.rhaos4.12.git6005903.el8
jfrancoa-3005-dvj5w-master-2.c.openshift-qe.internal         Ready                         control-plane,master   5h20m   v1.25.4+77bec7a   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   
cri-o://1.25.1-5.rhaos4.12.git6005903.el8
jfrancoa-3005-dvj5w-windows-worker-a-7qjgp                   Ready                         worker                 80m     v1.25.8+37a9a08   10.0.128.8    <none>        Windows Server 2022 Datacenter                                  10.0.20348.1726                containerd://1.6.19-4-geab1c5444
jfrancoa-3005-dvj5w-windows-worker-a-s9n4d                   Ready                         worker                 71m     v1.25.8+37a9a08   10.0.128.9    <none>        Windows Server 2022 Datacenter                                  10.0.20348.1726                containerd://1.6.19-4-geab1c5444
jfrancoa-3005-dvj5w-worker-a-f9bwh.c.openshift-qe.internal   Ready                         worker                 5h10m   v1.25.4+77bec7a   10.0.128.2    <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   cri-o://1.25.1-5.rhaos4.12.git6005903.el8
jfrancoa-3005-dvj5w-worker-b-wf59d.c.openshift-qe.internal   Ready                         worker                 5h10m   v1.25.4+77bec7a   10.0.128.3    <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   cri-o://1.25.1-5.rhaos4.12.git6005903.el8

When upgrading the OCP payload from 4.12.18 or 4.12.19, then the error dissapears and the upgrade succeeds (if done before the WMCO upgrade or even when the cluster is hanging in this state).

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0    True        False         4h54m   Cluster version is 4.12.0
$ oc get cm -n openshift-windows-machine-config-operator 
NAME                                   DATA   AGE
kube-root-ca.crt                       1      4h55m
openshift-service-ca.crt               1      4h55m
windows-instances                      2      122m
windows-machine-config-operator-lock   0      88m
windows-services-7.1.0-3eaf2b6         2      88m

Windows nodes kernel version:
byoh-winc-0.c.openshift-qe.internal  10.0.20348.1726
byoh-winc-1.c.openshift-qe.internal  10.0.20348.1726
jfrancoa-3005-dvj5w-windows-worker-a-7qjgp 10.0.20348.1726
jfrancoa-3005-dvj5w-worker-a-f9bwh.c.openshift-qe.internal 10.0.20348.1726

How reproducible:

Always. It was repruced in Platform:None and GCP.

Steps to Reproduce:

1. Deploy an OCP 4.12.0 (https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.12.0) cluster. Install WMCO 7.0.1 and create a BYOH Windows node.
2. Upgrade WMCO from 7.0.1 to WMCO 7.1.0
3. Wait for the upgrade to finish

Actual results:

The upgrade doesn't succeed, leaving the BYOH nodes in NotReady,SchedulingDisabled. No workloads can run on the BYOH nodes.

Expected results:

Upgrade to WMCO 7.1.0 succeeds

Additional info:

links to

openshift/windows-machine-config-operator#1644: OCPBUGS-14260: Check for minimum OCP version

mentioned on

Merge request - Updated US source to: cfb3eef Merge pull request #1644 from aravindhp/OCPBUGS-14260

Assignee:: Aravindh Puthiyaparambil (Inactive)

Reporter:: Jose Luis Franco Arza (Inactive)

QA Contact:: Aharon Rasouli

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/05/30 11:17 AM

Updated:: 2024/04/29 5:09 PM

Resolved:: 2023/07/18 12:18 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates