Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20082

Modifying Windows node's annotation windowsmachineconfig.openshift.io/version ends up in Ready,SchedulingDisabled

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • 4.12.z
    • 4.13, 4.14
    • Windows Containers
    • None
    • Critical
    • No
    • 0
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-13780. The following is the description of the original issue:

      Description of problem:

      
      Having a 4.14 OCP cluster with BYOH and Machine Windows Containers nodes, when modifying the the version annotation windowsmachineconfig.openshift.io/version. If the version set does not exist, for example:
      oc annotate node ip-10-0-148-206.us-east-2.compute.internal --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
      
      The impacted BYOH/Machine node will hang on Ready,Scheduling disabled and won't be able to leave that state as WMCO is trying to allocate the configmap windows-services-invalidVersion, which does not exist:
      
      [jfrancoa@localhost openshift-tests-private]$ oc get cm windows-instances -n openshift-windows-machine-config-operator -o yaml
      apiVersion: v1
      data:
        10.0.148.206: username=Administrator
      kind: ConfigMap
      metadata:
        creationTimestamp: "2023-05-18T07:31:13Z"
        name: windows-instances
        namespace: openshift-windows-machine-config-operator
        resourceVersion: "60960"
        uid: ab745d75-6f53-4919-b930-4edc88e5016d
      [jfrancoa@localhost openshift-tests-private]$ oc get nodes -o wide
      NAME                                         STATUS                     ROLES                  AGE    VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
      ip-10-0-136-0.us-east-2.compute.internal     Ready                      control-plane,master   135m   v1.27.1+20a4409   10.0.136.0     <none>        Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow)   5.14.0-284.13.1.el9_2.x86_64   cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-137-97.us-east-2.compute.internal    Ready                      worker                 88m    v1.26.2+0f23833   10.0.137.97    <none>        Windows Server 2019 Datacenter                                 10.0.17763.4252                containerd://1.7.0
      ip-10-0-139-10.us-east-2.compute.internal    Ready                      worker                 126m   v1.27.1+20a4409   10.0.139.10    <none>        Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow)   5.14.0-284.13.1.el9_2.x86_64   cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-148-206.us-east-2.compute.internal   Ready,SchedulingDisabled   worker                 23m    v1.26.2+0f23833   10.0.148.206   <none>        Windows Server 2019 Datacenter                                 10.0.17763.4252                containerd://1.7.0
      ip-10-0-149-147.us-east-2.compute.internal   Ready                      worker                 83m    v1.26.2+0f23833   10.0.149.147   <none>        Windows Server 2019 Datacenter                                 10.0.17763.4252                containerd://1.7.0
      ip-10-0-185-94.us-east-2.compute.internal    Ready                      worker                 127m   v1.27.1+20a4409   10.0.185.94    <none>        Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow)   5.14.0-284.13.1.el9_2.x86_64   cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-191-102.us-east-2.compute.internal   Ready                      control-plane,master   134m   v1.27.1+20a4409   10.0.191.102   <none>        Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow)   5.14.0-284.13.1.el9_2.x86_64   cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-212-18.us-east-2.compute.internal    Ready                      worker                 124m   v1.27.1+20a4409   10.0.212.18    <none>        Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow)   5.14.0-284.13.1.el9_2.x86_64   cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-213-56.us-east-2.compute.internal    Ready                      control-plane,master   134m   v1.27.1+20a4409   10.0.213.56    <none>        Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow)   5.14.0-284.13.1.el9_2.x86_64   cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      
      WMCO logs:
      
      {"level":"info","ts":"2023-05-18T07:44:13Z","logger":"nc 10.0.148.206","msg":"instance has been configured as a worker node","version":"9.0.0-0ecb2e1"}
      {"level":"info","ts":"2023-05-18T07:44:13Z","logger":"metrics","msg":"Prometheus configured","endpoints":"windows-exporter","port":9182,"name":"metrics"}
      {"level":"info","ts":"2023-05-18T07:45:13Z","logger":"controllers.configmap","msg":"processing","instances in":"windows-instances"}
      {"level":"info","ts":"2023-05-18T07:45:13Z","logger":"controllers.configmap","msg":"instance is up to date","node":"ip-10-0-148-206.us-east-2.compute.internal","version":"9.0.0-0ecb2e1"}
      {"level":"info","ts":"2023-05-18T07:45:13Z","logger":"metrics","msg":"Prometheus configured","endpoints":"windows-exporter","port":9182,"name":"metrics"}
      {"level":"info","ts":"2023-05-18T07:46:05Z","logger":"controllers.configmap","msg":"processing","instances in":"windows-instances"}
      {"level":"info","ts":"2023-05-18T07:46:06Z","logger":"controllers.configmap","msg":"instance requires upgrade","node":"ip-10-0-148-206.us-east-2.compute.internal","version":"invalidVersion","expected version":"9.0.0-0ecb2e1"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-smjhc\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-2vwxs\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-857wc\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-gq7g5\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-pzkdd\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"wc 10.0.148.206","msg":"deconfiguring"}
      {"level":"info","ts":"2023-05-18T07:46:47Z","logger":"wc 10.0.148.206","msg":"deconfigured","service":"windows-instance-config-daemon"}
      {"level":"error","ts":"2023-05-18T07:46:52Z","logger":"wc 10.0.148.206","msg":"error running","cmd":"powershell.exe -NonInteractive -ExecutionPolicy Bypass \"C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshi
      ft-windows-machine-config-operator\"","out":"F0518 07:46:52.825267    2756 cleanup.go:51] configmaps \"windows-services-invalidVersion\" not found\n","error":"Process exited with status 1","stacktrace":"github.com/openshift/windows-machine-config-operator/pkg/win
      dows.(*windows).Run\n\t/remote-source/build/windows-machine-config-operator/pkg/windows/windows.go:381\ngithub.com/openshift/windows-machine-config-operator/pkg/windows.(*windows).RunWICDCleanup\n\t/remote-source/build/windows-machine-config-operator/pkg/windows/
      windows.go:408\ngithub.com/openshift/windows-machine-config-operator/pkg/windows.(*windows).Deconfigure\n\t/remote-source/build/windows-machine-config-operator/pkg/windows/windows.go:418\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeC
      onfig).Deconfigure\n\t/remote-source/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:485\ngithub.com/openshift/windows-machine-config-operator/controllers.(*instanceReconciler).ensureInstanceIsUpToDate\n\t/remote-source/build/windows-machine-co
      nfig-operator/controllers/controllers.go:79\ngithub.com/openshift/windows-machine-config-operator/controllers.(*ConfigMapReconciler).ensureInstancesAreUpToDate\n\t/remote-source/build/windows-machine-config-operator/controllers/configmap_controller.go:314\ngithub
      .com/openshift/windows-machine-config-operator/controllers.(*ConfigMapReconciler).reconcileNodes\n\t/remote-source/build/windows-machine-config-operator/controllers/configmap_controller.go:279\ngithub.com/openshift/windows-machine-config-operator/controllers.(*Co
      nfigMapReconciler).Reconcile\n\t/remote-source/build/windows-machine-config-operator/controllers/configmap_controller.go:189\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/remote-source/build/windows-machine-config-operator/ve
      ndor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/
      internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274
      \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"}
      {"level":"info","ts":"2023-05-18T07:46:52Z","logger":"wc 10.0.148.206","msg":"failed to cleanup node","command":"C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshift-windows-machine-config-operator","output":
      "F0518 07:46:52.825267    2756 cleanup.go:51] configmaps \"windows-services-invalidVersion\" not found\n"}
      {"level":"error","ts":"2023-05-18T07:46:52Z","msg":"Reconciler error","controller":"configmap","controllerGroup":"","controllerKind":"ConfigMap","ConfigMap":{"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"},"namespace":"openshif
      t-windows-machine-config-operator","name":"windows-instances","reconcileID":"6cf1733d-f108-43b9-832e-ee4400e00eef","error":"error configuring host with address 10.0.148.206: error deconfiguring instance: unable to cleanup the Windows instance: error running power
      shell.exe -NonInteractive -ExecutionPolicy Bypass \"C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshift-windows-machine-config-operator\": Process exited with status 1","stacktrace":"sigs.k8s.io/controller-r
      untime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Control
      ler).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/
      build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"}
      {"level":"info","ts":"2023-05-18T07:46:52Z","logger":"controllers.configmap","msg":"processing","instances in":"windows-instances"}
      {"level":"info","ts":"2023-05-18T07:46:53Z","logger":"controllers.configmap","msg":"instance requires upgrade","node":"ip-10-0-148-206.us-east-2.compute.internal","version":"invalidVersion","expected version":"9.0.0-0ecb2e1"}
      
      

      Version-Release number of selected component (if applicable):

      [jfrancoa@localhost openshift-tests-private]$ oc get cm -n openshift-windows-machine-config-operator 
      NAME                                   DATA   AGE
      kube-root-ca.crt                       1      102m
      openshift-service-ca.crt               1      102m
      windows-instances                      1      35m
      windows-machine-config-operator-lock   0      101m
      windows-services-9.0.0-0ecb2e1         2      101m
      [jfrancoa@localhost openshift-tests-private]$ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.0-0.nightly-2023-05-18-040905   True        False         113m    Cluster version is 4.14.0-0.nightly-2023-05-18-040905
      
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Deploy an OCP 4.14 cluster. Install WMCO on it.
      2. Create a BYOH or Machine Windows node
      3. Modify the version annotation and set it to invalidVersion: oc annotate node ip-10-0-148-206.us-east-2.compute.internal --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
      4. Wait for the node to reconcile
      

      Actual results:

      The node does not reconcile and hangs in Ready,SchedulingDisabled for ever
      

      Expected results:

      The node reconciles and restores back the invalid version annotation.
      

      Additional info:

      This is a regression. This functionality was working in all previous versions.
      

              rh-ee-mankulka Mansi Kulkarni
              openshift-crt-jira-prow OpenShift Prow Bot
              Aharon Rasouli Aharon Rasouli
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: