Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14260

Upgrade from WMCO 7.0.1 to 7.1.0 not working on Windows BYOH nodes: error waiting for proper windowsmachineconfig.openshift.io/version annotation for node

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • 4.12.z
    • 4.12.z, 4.12
    • Windows Containers
    • None
    • No
    • 3
    • WINC - Sprint 237, WINC - Sprint 238
    • 2
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When having a OCP cluster with Windows BYOH nodes on 4.12.0 GA release, the seamless upgrade from WMCO 7.0.1 to WMCO 7.1.0 does not work. Hanging while trying to upgrade the BYOH nodes displaying the following error in WMCO:
      {"level":"info","ts":"2023-05-30T10:49:09Z","logger":"nc 10.0.128.6","msg":"Unable to mark node as NotReady","error":"error running powershell.exe -NonInteractive -ExecutionPolicy Bypass \"C:\\k\\windows-instance-config-daemon.exe cleanup --api-server https://api-int.jfrancoa-3005.qe.gcp.devcluster.openshift.com:6443 --sa-ca C:\\k\\sa-ca.crt --sa-token C:\\k\\sa-token --namespace openshift-windows-machine-config-operator\": Process exited with status 1"}
      {"level":"error","ts":"2023-05-30T10:49:09Z","msg":"Reconciler error","controller":"configmap","controllerGroup":"","controllerKind":"ConfigMap","configMap":{"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"},"namespace":"openshift-windows-machine-config-operator","name":"windows-instances","reconcileID":"08536e66-5946-49ce-b225-fd5887089d63","error":"error configuring host with address 10.0.128.6: error waiting for proper windowsmachineconfig.openshift.io/version annotation for node byoh-winc-1.c.openshift-qe.internal: timeout waiting for windowsmachineconfig.openshift.io/version and windowsmachineconfig.openshift.io/desired-version annotations to match on node byoh-winc-1.c.openshift-qe.internal: timed out waiting for the condition","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}
      
      The BYOH nodes get stucked in NotReady,SchedulingDisabled and the upgrade does not move on:
      
      $ oc get nodes -o wide
      NAME                                                         STATUS                        ROLES                  AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
      byoh-winc-1.c.openshift-qe.internal                          NotReady,SchedulingDisabled   worker                 82m     v1.25.8+37a9a08   10.0.128.6    <none>        Windows Server 2022 Datacenter                                  10.0.20348.1726                containerd://1.6.19-4-geab1c5444
      jfrancoa-3005-dvj5w-master-0.c.openshift-qe.internal         Ready                         control-plane,master   5h21m   v1.25.4+77bec7a   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   cri-o://1.25.1-5.rhaos4.12.git6005903.el8
      jfrancoa-3005-dvj5w-master-1.c.openshift-qe.internal         Ready                         control-plane,master   5h20m   v1.25.4+77bec7a   10.0.0.3      <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   cri-o://1.25.1-5.rhaos4.12.git6005903.el8
      jfrancoa-3005-dvj5w-master-2.c.openshift-qe.internal         Ready                         control-plane,master   5h20m   v1.25.4+77bec7a   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   
      cri-o://1.25.1-5.rhaos4.12.git6005903.el8
      jfrancoa-3005-dvj5w-windows-worker-a-7qjgp                   Ready                         worker                 80m     v1.25.8+37a9a08   10.0.128.8    <none>        Windows Server 2022 Datacenter                                  10.0.20348.1726                containerd://1.6.19-4-geab1c5444
      jfrancoa-3005-dvj5w-windows-worker-a-s9n4d                   Ready                         worker                 71m     v1.25.8+37a9a08   10.0.128.9    <none>        Windows Server 2022 Datacenter                                  10.0.20348.1726                containerd://1.6.19-4-geab1c5444
      jfrancoa-3005-dvj5w-worker-a-f9bwh.c.openshift-qe.internal   Ready                         worker                 5h10m   v1.25.4+77bec7a   10.0.128.2    <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   cri-o://1.25.1-5.rhaos4.12.git6005903.el8
      jfrancoa-3005-dvj5w-worker-b-wf59d.c.openshift-qe.internal   Ready                         worker                 5h10m   v1.25.4+77bec7a   10.0.128.3    <none>        Red Hat Enterprise Linux CoreOS 412.86.202301061548-0 (Ootpa)   4.18.0-372.40.1.el8_6.x86_64   cri-o://1.25.1-5.rhaos4.12.git6005903.el8
      
      When upgrading the OCP payload from 4.12.18 or 4.12.19, then the error dissapears and the upgrade succeeds (if done before the WMCO upgrade or even when the cluster is hanging in this state).
      
      

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.0    True        False         4h54m   Cluster version is 4.12.0
      $ oc get cm -n openshift-windows-machine-config-operator 
      NAME                                   DATA   AGE
      kube-root-ca.crt                       1      4h55m
      openshift-service-ca.crt               1      4h55m
      windows-instances                      2      122m
      windows-machine-config-operator-lock   0      88m
      windows-services-7.1.0-3eaf2b6         2      88m
      
      Windows nodes kernel version:
      byoh-winc-0.c.openshift-qe.internal  10.0.20348.1726
      byoh-winc-1.c.openshift-qe.internal  10.0.20348.1726
      jfrancoa-3005-dvj5w-windows-worker-a-7qjgp 10.0.20348.1726
      jfrancoa-3005-dvj5w-worker-a-f9bwh.c.openshift-qe.internal 10.0.20348.1726
      
      

      How reproducible:

      Always. It was repruced in Platform:None and GCP.
      
      

      Steps to Reproduce:

      1. Deploy an OCP 4.12.0 (https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.12.0) cluster. Install WMCO 7.0.1 and create a BYOH Windows node.
      2. Upgrade WMCO from 7.0.1 to WMCO 7.1.0
      3. Wait for the upgrade to finish
      

      Actual results:

      The upgrade doesn't succeed, leaving the BYOH nodes in NotReady,SchedulingDisabled. No workloads can run on the BYOH nodes.
      
      

      Expected results:

      Upgrade to WMCO 7.1.0 succeeds
      

      Additional info:

      
      

            paravindh Aravindh Puthiyaparambil
            rhn-engineering-jfrancoa Jose Luis Franco Arza (Inactive)
            Aharon Rasouli Aharon Rasouli
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: