Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35285

Version annotation removal results in unusable node

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • 4.17.0, 4.18
    • Windows Containers
    • None

      Description of problem:

      If the version annotation 'windowsmachineconfig.openshift.io/version' is removed from a node object, the node binaries (WICD, kubelet, etc.) may be stopped and not restarted.

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Low %

      Steps to Reproduce:

      1. Remove the version annotation from a Windows node
      

      Actual results:

      The version annotation is re-added to the node, but the node is no longer ready and schedulable.
      
      WMCO logs contains a cleanup failure, and further reconciliations do not fix the node state:
      
      2024-06-10T23:32:13Z	INFO	wc 10.0.19.87	failed to cleanup node	{"command": "C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshift-windows-machine-config-operator", "output": "I0610 23:28:38.536824    7952 cleanup.go:132] error getting services ConfigMap associated with version annotation, falling back to use latest services ConfigMap: node is missing version annotation\nI0610 23:32:13.630884    7952 cleanup.go:197] removed services: [\"csi-proxy\" \"hybrid-overlay-node\" \"windows_exporter\" \"kubelet\" \"containerd\"]\nF0610 23:32:13.630884    7952 cleanup.go:51] []error{(*fmt.wrapError)(0xc000020020)}\n"}
      2024-06-10T23:32:13Z	ERROR	Reconciler error	{"controller": "machine", "controllerGroup": "machine.openshift.io", "controllerKind": "Machine", "Machine": {"name":"ci-op-igcd4qjs-9c393-ndx27-e2e-wm-b4vnt","namespace":"openshift-machine-api"}, "namespace": "openshift-machine-api", "name": "ci-op-igcd4qjs-9c393-ndx27-e2e-wm-b4vnt", "reconcileID": "7ecf387a-befa-48c0-90cf-739f28a2be7d", "error": "unable to configure instance i-057fb07bfc7ed073e: bootstrapping the Windows instance failed: unable to cleanup the Windows instance: error running powershell.exe -NonInteractive -ExecutionPolicy Bypass \"C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshift-windows-machine-config-operator\": Process exited with status 1"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
      	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
      	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:261
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
      	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222
      2024-06-10T23:32:13Z	DEBUG	controller.windowsmachine	reconciling	{"windowsmachine": {"name":"ci-op-igcd4qjs-9c393-ndx27-e2e-wm-b4vnt","namespace":"openshift-machine-api"}}
      2024-06-10T23:32:13Z	DEBUG	events	Machine ci-op-igcd4qjs-9c393-ndx27-e2e-wm-b4vnt configuration failure	{"type": "Warning", "object": {"kind":"Machine","namespace":"openshift-machine-api","name":"ci-op-igcd4qjs-9c393-ndx27-e2e-wm-b4vnt","uid":"91c21fe8-154d-40ee-bdba-a34b2235a316","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"49707"}, "reason": "MachineSetupFailure"}
      2024-06-10T23:32:28Z	DEBUG	controller.windowsmachine	reconciling	{"windowsmachine": {"name":"ci-op-igcd4qjs-9c393-ndx27-e2e-wm-b4vnt","namespace":"openshift-machine-api"}}
      2024-06-10T23:32:58Z	DEBUG	controller.windowsmachine	reconciling	{"windowsmachine": {"name":"ci-op-igcd4qjs-9c393-ndx27-e2e-wm-b4vnt","namespace":"openshift-machine-api"}}

      Expected results:

      The version annotation is re-added to the node, and the node maintains functionality.

      Additional info:

      Potential cause:
      - User removes version annotation
      - WMCO decides node is not up to date, and tries to configure it
      - WICD decides node is up to date, and re-applies the version annotation
      - WMCO stops WICD
      - WMCO runs WICD cleanup
      - WICD cleanup fails, resulting in all node binaries being stopped
      - WMCO restarts reconciliation after backoff time
      - WMCO sees that the version annotation is correct, and decides the node does not need to be configured.
      - Node binaries remain stopped

              jvaldes@redhat.com Jose Valdes
              rh-ee-ssoto Sebastian Soto
              Aharon Rasouli Aharon Rasouli
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: