Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13780

Modifying Windows node's annotation windowsmachineconfig.openshift.io/version ends up in Ready,SchedulingDisabled

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • 4.14.0
    • 4.13, 4.14
    • Windows Containers
    • None
    • Critical
    • No
    • 3
    • WINC - Sprint 237, WINC - Sprint 238
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      Fixes an issue which would occur if the node annotation windowsmachineconfig.openshift.io/version was unexpectedly changed. This would cause the node to enter an unusable state. The dependency of a valid version annotation was removed, allowing the annotation to be corrected, and the node returned to the proper state.
      Show
      Fixes an issue which would occur if the node annotation windowsmachineconfig.openshift.io/version was unexpectedly changed. This would cause the node to enter an unusable state. The dependency of a valid version annotation was removed, allowing the annotation to be corrected, and the node returned to the proper state.
    • Bug Fix

      Description of problem:

      
      Having a 4.14 OCP cluster with BYOH and Machine Windows Containers nodes, when modifying the the version annotation windowsmachineconfig.openshift.io/version. If the version set does not exist, for example:
      oc annotate node ip-10-0-148-206.us-east-2.compute.internal --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
      
      The impacted BYOH/Machine node will hang on Ready,Scheduling disabled and won't be able to leave that state as WMCO is trying to allocate the configmap windows-services-invalidVersion, which does not exist:
      
      [jfrancoa@localhost openshift-tests-private]$ oc get cm windows-instances -n openshift-windows-machine-config-operator -o yaml
      apiVersion: v1
      data:
        10.0.148.206: username=Administrator
      kind: ConfigMap
      metadata:
        creationTimestamp: "2023-05-18T07:31:13Z"
        name: windows-instances
        namespace: openshift-windows-machine-config-operator
        resourceVersion: "60960"
        uid: ab745d75-6f53-4919-b930-4edc88e5016d
      [jfrancoa@localhost openshift-tests-private]$ oc get nodes -o wide
      NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
      ip-10-0-136-0.us-east-2.compute.internal Ready control-plane,master 135m v1.27.1+20a4409 10.0.136.0 <none> Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow) 5.14.0-284.13.1.el9_2.x86_64 cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-137-97.us-east-2.compute.internal Ready worker 88m v1.26.2+0f23833 10.0.137.97 <none> Windows Server 2019 Datacenter 10.0.17763.4252 containerd://1.7.0
      ip-10-0-139-10.us-east-2.compute.internal Ready worker 126m v1.27.1+20a4409 10.0.139.10 <none> Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow) 5.14.0-284.13.1.el9_2.x86_64 cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-148-206.us-east-2.compute.internal Ready,SchedulingDisabled worker 23m v1.26.2+0f23833 10.0.148.206 <none> Windows Server 2019 Datacenter 10.0.17763.4252 containerd://1.7.0
      ip-10-0-149-147.us-east-2.compute.internal Ready worker 83m v1.26.2+0f23833 10.0.149.147 <none> Windows Server 2019 Datacenter 10.0.17763.4252 containerd://1.7.0
      ip-10-0-185-94.us-east-2.compute.internal Ready worker 127m v1.27.1+20a4409 10.0.185.94 <none> Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow) 5.14.0-284.13.1.el9_2.x86_64 cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-191-102.us-east-2.compute.internal Ready control-plane,master 134m v1.27.1+20a4409 10.0.191.102 <none> Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow) 5.14.0-284.13.1.el9_2.x86_64 cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-212-18.us-east-2.compute.internal Ready worker 124m v1.27.1+20a4409 10.0.212.18 <none> Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow) 5.14.0-284.13.1.el9_2.x86_64 cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      ip-10-0-213-56.us-east-2.compute.internal Ready control-plane,master 134m v1.27.1+20a4409 10.0.213.56 <none> Red Hat Enterprise Linux CoreOS 414.92.202305162029-0 (Plow) 5.14.0-284.13.1.el9_2.x86_64 cri-o://1.27.0-6.rhaos4.14.git81ac4ce.el9
      
      WMCO logs:
      
      {"level":"info","ts":"2023-05-18T07:44:13Z","logger":"nc 10.0.148.206","msg":"instance has been configured as a worker node","version":"9.0.0-0ecb2e1"}
      {"level":"info","ts":"2023-05-18T07:44:13Z","logger":"metrics","msg":"Prometheus configured","endpoints":"windows-exporter","port":9182,"name":"metrics"}
      {"level":"info","ts":"2023-05-18T07:45:13Z","logger":"controllers.configmap","msg":"processing","instances in":"windows-instances"}
      {"level":"info","ts":"2023-05-18T07:45:13Z","logger":"controllers.configmap","msg":"instance is up to date","node":"ip-10-0-148-206.us-east-2.compute.internal","version":"9.0.0-0ecb2e1"}
      {"level":"info","ts":"2023-05-18T07:45:13Z","logger":"metrics","msg":"Prometheus configured","endpoints":"windows-exporter","port":9182,"name":"metrics"}
      {"level":"info","ts":"2023-05-18T07:46:05Z","logger":"controllers.configmap","msg":"processing","instances in":"windows-instances"}
      {"level":"info","ts":"2023-05-18T07:46:06Z","logger":"controllers.configmap","msg":"instance requires upgrade","node":"ip-10-0-148-206.us-east-2.compute.internal","version":"invalidVersion","expected version":"9.0.0-0ecb2e1"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-smjhc\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-2vwxs\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-857wc\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-gq7g5\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"nc 10.0.148.206","msg":"evicting pod winc-42484/win-webserver-768b7bc78d-pzkdd\n"}
      {"level":"info","ts":"2023-05-18T07:46:16Z","logger":"wc 10.0.148.206","msg":"deconfiguring"}
      {"level":"info","ts":"2023-05-18T07:46:47Z","logger":"wc 10.0.148.206","msg":"deconfigured","service":"windows-instance-config-daemon"}
      {"level":"error","ts":"2023-05-18T07:46:52Z","logger":"wc 10.0.148.206","msg":"error running","cmd":"powershell.exe -NonInteractive -ExecutionPolicy Bypass \"C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshi
      ft-windows-machine-config-operator\"","out":"F0518 07:46:52.825267 2756 cleanup.go:51] configmaps \"windows-services-invalidVersion\" not found\n","error":"Process exited with status 1","stacktrace":"github.com/openshift/windows-machine-config-operator/pkg/win
      dows.(*windows).Run\n\t/remote-source/build/windows-machine-config-operator/pkg/windows/windows.go:381\ngithub.com/openshift/windows-machine-config-operator/pkg/windows.(*windows).RunWICDCleanup\n\t/remote-source/build/windows-machine-config-operator/pkg/windows/
      windows.go:408\ngithub.com/openshift/windows-machine-config-operator/pkg/windows.(*windows).Deconfigure\n\t/remote-source/build/windows-machine-config-operator/pkg/windows/windows.go:418\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeC
      onfig).Deconfigure\n\t/remote-source/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:485\ngithub.com/openshift/windows-machine-config-operator/controllers.(*instanceReconciler).ensureInstanceIsUpToDate\n\t/remote-source/build/windows-machine-co
      nfig-operator/controllers/controllers.go:79\ngithub.com/openshift/windows-machine-config-operator/controllers.(*ConfigMapReconciler).ensureInstancesAreUpToDate\n\t/remote-source/build/windows-machine-config-operator/controllers/configmap_controller.go:314\ngithub
      .com/openshift/windows-machine-config-operator/controllers.(*ConfigMapReconciler).reconcileNodes\n\t/remote-source/build/windows-machine-config-operator/controllers/configmap_controller.go:279\ngithub.com/openshift/windows-machine-config-operator/controllers.(*Co
      nfigMapReconciler).Reconcile\n\t/remote-source/build/windows-machine-config-operator/controllers/configmap_controller.go:189\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/remote-source/build/windows-machine-config-operator/ve
      ndor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/
      internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274
      \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"}
      {"level":"info","ts":"2023-05-18T07:46:52Z","logger":"wc 10.0.148.206","msg":"failed to cleanup node","command":"C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshift-windows-machine-config-operator","output":
      "F0518 07:46:52.825267 2756 cleanup.go:51] configmaps \"windows-services-invalidVersion\" not found\n"}
      {"level":"error","ts":"2023-05-18T07:46:52Z","msg":"Reconciler error","controller":"configmap","controllerGroup":"","controllerKind":"ConfigMap","ConfigMap":{"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"},"namespace":"openshif
      t-windows-machine-config-operator","name":"windows-instances","reconcileID":"6cf1733d-f108-43b9-832e-ee4400e00eef","error":"error configuring host with address 10.0.148.206: error deconfiguring instance: unable to cleanup the Windows instance: error running power
      shell.exe -NonInteractive -ExecutionPolicy Bypass \"C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshift-windows-machine-config-operator\": Process exited with status 1","stacktrace":"sigs.k8s.io/controller-r
      untime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Control
      ler).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/
      build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"}
      {"level":"info","ts":"2023-05-18T07:46:52Z","logger":"controllers.configmap","msg":"processing","instances in":"windows-instances"}
      {"level":"info","ts":"2023-05-18T07:46:53Z","logger":"controllers.configmap","msg":"instance requires upgrade","node":"ip-10-0-148-206.us-east-2.compute.internal","version":"invalidVersion","expected version":"9.0.0-0ecb2e1"}
      
      

      Version-Release number of selected component (if applicable):

      [jfrancoa@localhost openshift-tests-private]$ oc get cm -n openshift-windows-machine-config-operator
      NAME DATA AGE
      kube-root-ca.crt 1 102m
      openshift-service-ca.crt 1 102m
      windows-instances 1 35m
      windows-machine-config-operator-lock 0 101m
      windows-services-9.0.0-0ecb2e1 2 101m
      [jfrancoa@localhost openshift-tests-private]$ oc get clusterversion
      NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
      version 4.14.0-0.nightly-2023-05-18-040905 True False 113m Cluster version is 4.14.0-0.nightly-2023-05-18-040905
      
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Deploy an OCP 4.14 cluster. Install WMCO on it.
      2. Create a BYOH or Machine Windows node
      3. Modify the version annotation and set it to invalidVersion: oc annotate node ip-10-0-148-206.us-east-2.compute.internal --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
      4. Wait for the node to reconcile
      

      Actual results:

      The node does not reconcile and hangs in Ready,SchedulingDisabled for ever
      

      Expected results:

      The node reconciles and restores back the invalid version annotation.
      

      Additional info:

      This is a regression. This functionality was working in all previous versions.
      

              rh-ee-ssoto Sebastian Soto
              rhn-engineering-jfrancoa Jose Luis Franco Arza (Inactive)
              Aharon Rasouli Aharon Rasouli
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: