Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14700

BYOH node failed to upgrade: Cannot remove item C:\\var\\log\\containerd\\containerd.log: The process cannot access the file \r\n'containerd.log' because it is being used by another process

    • +
    • Critical
    • Yes
    • 3
    • WINC - Sprint 238
    • 1
    • False
    • Hide

      None

      Show
      None
    • Fixes an issue which would stop Windows Nodes from being deconfigured due to failing to remove the containerd log file. This has been fixed by properly stopping containerd before attempting to remove the log file.
    • Bug Fix

      Description of problem:

      Upgrading a BYOH node is failing, the BYOH node after upgrade remains in NotReady,SchedulingDisabled'
      
      {"level":"error","ts":"2023-06-07T16:52:46Z","msg":"Reconciler error","controller":"configmap","controllerGroup":"","controllerKind":"ConfigMap","ConfigMap":{"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"},"namespace":"openshift-windows-machine-config-operator","name":"windows-instances","reconcileID":"8e6cc51d-9fd4-4e44-b39e-6b6d678c6422","error":"error configuring host with address 10.0.128.7: error deconfiguring instance: unable to remove created directories: unable to remove directory C:\\var\\log, out: Remove-Item : Cannot remove item C:\\var\\log\\containerd\\containerd.log: The process cannot access the file \r\n'containerd.log' because it is being used by another process.\r\nAt line:1 char:27\r\n+ if(Test-Path C:\\var\\log) {Remove-Item -Recurse -Force C:\\var\\log}\r\n+                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n    + CategoryInfo          : WriteError: (containerd.log:FileInfo) [Remove-Item], IOException\r\n    + FullyQualifiedErrorId : RemoveFileSystemItemIOError,Microsoft.PowerShell.Commands.RemoveItemCommand\r\nRemove-Item : Cannot remove item C:\\var\\log\\containerd: The directory is not empty.\r\nAt line:1 char:27\r\n+ if(Test-Path C:\\var\\log) {Remove-Item -Recurse -Force C:\\var\\log}\r\n+                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n    + CategoryInfo          : WriteError: (containerd:DirectoryInfo) [Remove-Item], IOException\r\n    + FullyQualifiedErrorId : RemoveFileSystemItemIOError,Microsoft.PowerShell.Commands.RemoveItemCommand\r\nRemove-Item : Cannot remove item C:\\var\\log: The directory is not empty.\r\nAt line:1 char:27\r\n+ if(Test-Path C:\\var\\log) {Remove-Item -Recurse -Force C:\\var\\log}\r\n+                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n    + CategoryInfo          : WriteError: (C:\\var\\log:DirectoryInfo) [Remove-Item], IOException\r\n    + FullyQualifiedErrorId : RemoveFileSystemItemIOError,Microsoft.PowerShell.Commands.RemoveItemCommand\r\n, err: error running powershell.exe -NonInteractive -ExecutionPolicy Bypass \"if(Test-Path C:\\var\\log) {Remove-Item -Recurse -Force C:\\var\\log}\": Process exited with status 1","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"}

      Version-Release number of selected component (if applicable):

      Upgrading from:
      windows-services-7.0.1-bc9473b         2      6h13m
      To:
      windows-services-8.0.1-01a3618         2      53m

      How reproducible:

      Most likely, in AWS passed

      Steps to Reproduce:

      1. Install a BYOH node server 2022 on GCP (not via machineset)
      2. perform upgrade from 7.0.1-bc9473b 4.12 to 4.13 8.0.1-01a3618
      3. wait until the upgrade completed 
      

      Actual results:

      In case machineset node get upgraded BYOH is stuck in NotReady,SchedulingDisabled

      Expected results:

      Nodes should be in Ready after upgrade with the correct kubelet version

      Additional info:

       oc get nodes -owide
      NAME                                                        STATUS                        ROLES                  AGE     VERSION                       INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
      mgcp-byoh-0.c.openshift-qe.internal                         NotReady,SchedulingDisabled   worker                 4h25m   v1.25.0-2653+a34b9e9499e6c3   10.0.128.7    <none>        Windows Server 2022 Datacenter                                 10.0.20348.1726                containerd://1.19
      rrasouli-397-x7hdb-master-0.c.openshift-qe.internal         Ready                         control-plane,master   6h55m   v1.26.5+7a891f0               10.0.0.3      <none>        Red Hat Enterprise Linux CoreOS 413.92.202306010245-0 (Plow)   5.14.0-284.16.1.el9_2.x86_64   cri-o://1.26.3-8.rhaos4.13.gitec064c9.el9
      rrasouli-397-x7hdb-master-1.c.openshift-qe.internal         Ready                         control-plane,master   6h56m   v1.26.5+7a891f0               10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 413.92.202306010245-0 (Plow)   5.14.0-284.16.1.el9_2.x86_64   cri-o://1.26.3-8.rhaos4.13.gitec064c9.el9
      rrasouli-397-x7hdb-master-2.c.openshift-qe.internal         Ready                         control-plane,master   6h54m   v1.26.5+7a891f0               10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 413.92.202306010245-0 (Plow)   5.14.0-284.16.1.el9_2.x86_64   cri-o://1.26.3-8.rhaos4.13.gitec064c9.el9
      rrasouli-397-x7hdb-worker-a-5872s.c.openshift-qe.internal   Ready                         worker                 6h44m   v1.26.5+7a891f0               10.0.128.3    <none>        Red Hat Enterprise Linux CoreOS 413.92.202306010245-0 (Plow)   5.14.0-284.16.1.el9_2.x86_64   cri-o://1.26.3-8.rhaos4.13.gitec064c9.el9
      rrasouli-397-x7hdb-worker-b-fsc8d.c.openshift-qe.internal   Ready                         worker                 6h44m   v1.26.5+7a891f0               10.0.128.2    <none>        Red Hat Enterprise Linux CoreOS 413.92.202306010245-0 (Plow)   5.14.0-284.16.1.el9_2.x86_64   cri-o://1.26.3-8.rhaos4.13.gitec064c9.el9

            [OCPBUGS-14700] BYOH node failed to upgrade: Cannot remove item C:\\var\\log\\containerd\\containerd.log: The process cannot access the file \r\n'containerd.log' because it is being used by another process

            Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

            OpenShift Jira Automation Bot added a comment - Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: Red Hat OpenShift for Windows Containers 9.0.0 security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:7515

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: Red Hat OpenShift for Windows Containers 9.0.0 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7515

            Verified by upgrading from 4.13 to 4.14:

            $ oc get cm -n openshift-windows-machine-config-operator 
            NAME                                   DATA   AGE
            kube-root-ca.crt                       1      3h21m
            openshift-service-ca.crt               1      3h21m
            windows-instances                      2      137m
            windows-machine-config-operator-lock   0      15m
            windows-services-9.0.0-f079f3d         2      15m
            
            $ oc get clusterversion
            NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.14.0-0.nightly-2023-06-27-082521   True        False         33m     Cluster version is 4.14.0-0.nightly-2023-06-27-082521
            

            All windows nodes (BYOH and non BYOH) upgraded successfully:

            $ oc get nodes
            NAME                                                         STATUS   ROLES                  AGE     VERSION
            byoh-winc-0.c.openshift-qe.internal                          Ready    worker                 124m    v1.27.2+15f19ea
            byoh-winc-1.c.openshift-qe.internal                          Ready    worker                 129m    v1.27.2+15f19ea
            jfrancoa-2706-v2ttp-master-0.c.openshift-qe.internal         Ready    control-plane,master   3h53m   v1.27.3+cb4b47e
            jfrancoa-2706-v2ttp-master-1.c.openshift-qe.internal         Ready    control-plane,master   3h53m   v1.27.3+cb4b47e
            jfrancoa-2706-v2ttp-master-2.c.openshift-qe.internal         Ready    control-plane,master   3h53m   v1.27.3+cb4b47e
            jfrancoa-2706-v2ttp-windows-worker-a-89drp                   Ready    worker                 3h8m    v1.27.2+15f19ea
            jfrancoa-2706-v2ttp-windows-worker-a-8bjq6                   Ready    worker                 3h11m   v1.27.2+15f19ea
            jfrancoa-2706-v2ttp-worker-a-4lctn.c.openshift-qe.internal   Ready    worker                 3h42m   v1.27.3+cb4b47e
            jfrancoa-2706-v2ttp-worker-b-ftkdz.c.openshift-qe.internal   Ready    worker                 3h42m   v1.27.3+cb4b47e
            

            Jose Luis Franco Arza (Inactive) added a comment - - edited Verified by upgrading from 4.13 to 4.14: $ oc get cm -n openshift-windows-machine-config- operator NAME DATA AGE kube-root-ca.crt 1 3h21m openshift-service-ca.crt 1 3h21m windows-instances 2 137m windows-machine-config- operator -lock 0 15m windows-services-9.0.0-f079f3d 2 15m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-06-27-082521 True False 33m Cluster version is 4.14.0-0.nightly-2023-06-27-082521 All windows nodes (BYOH and non BYOH) upgraded successfully: $ oc get nodes NAME STATUS ROLES AGE VERSION byoh-winc-0.c.openshift-qe.internal Ready worker 124m v1.27.2+15f19ea byoh-winc-1.c.openshift-qe.internal Ready worker 129m v1.27.2+15f19ea jfrancoa-2706-v2ttp-master-0.c.openshift-qe.internal Ready control-plane,master 3h53m v1.27.3+cb4b47e jfrancoa-2706-v2ttp-master-1.c.openshift-qe.internal Ready control-plane,master 3h53m v1.27.3+cb4b47e jfrancoa-2706-v2ttp-master-2.c.openshift-qe.internal Ready control-plane,master 3h53m v1.27.3+cb4b47e jfrancoa-2706-v2ttp-windows-worker-a-89drp Ready worker 3h8m v1.27.2+15f19ea jfrancoa-2706-v2ttp-windows-worker-a-8bjq6 Ready worker 3h11m v1.27.2+15f19ea jfrancoa-2706-v2ttp-worker-a-4lctn.c.openshift-qe.internal Ready worker 3h42m v1.27.3+cb4b47e jfrancoa-2706-v2ttp-worker-b-ftkdz.c.openshift-qe.internal Ready worker 3h42m v1.27.3+cb4b47e

            CPaaS Service Account mentioned this issue in a merge request of openshift-winc-midstream / openshift-winc-midstream on branch rhaos-4.14-rhel-9_upstream_57c86a95a1394b0788ffe6f791b97365:

            Updated US source to: f079f3d Merge pull request #1654 from sebsoto/betterUpgradeFix

            GitLab CEE Bot added a comment - CPaaS Service Account mentioned this issue in a merge request of openshift-winc-midstream / openshift-winc-midstream on branch rhaos-4.14-rhel-9_ upstream _57c86a95a1394b0788ffe6f791b97365 : Updated US source to: f079f3d Merge pull request #1654 from sebsoto/betterUpgradeFix

            A correction to the above comment.

            The desired version staying the same has nothing to do with containerd being installed by WMCO and then being removed by WICD during an upgrade. That is an orthogonal issue.

            The issue that we are seeing in this bug is caused by containerd being installed by WMCO and then being removed by WICD upon upgrade and nothing else.

            One approach to fix the issue could potentially be to fix the desired version before deconfiguration. But that is still being discussed.

            Aravindh Puthiyaparambil (Inactive) added a comment - A correction to the above comment. The desired version staying the same has nothing to do with containerd being installed by WMCO and then being removed by WICD during an upgrade. That is an orthogonal issue. The issue that we are seeing in this bug is caused by containerd being installed by WMCO and then being removed by WICD upon upgrade and nothing else. One approach to fix the issue could potentially be to fix the desired version before deconfiguration. But that is still being discussed.

            After testing the upgrade manually, we found that on versions where containerd was installed by WMCO, upgrading to a version where containerd is removed by WICD causes the desired version to stay the same as the current version, which causes any upgrades to fail. We're currently working on a fix.

            Skyler Clark added a comment - After testing the upgrade manually, we found that on versions where containerd was installed by WMCO, upgrading to a version where containerd is removed by WICD causes the desired version to stay the same as the current version, which causes any upgrades to fail. We're currently working on a fix.

            we're currently reverting this change  in order to get the release working.

            Skyler Clark added a comment - we're currently reverting this change  in order to get the release working.

            Confirmed. WMCO 7.1.0 includes containerd service, while 7.0.1 does not include it:

            $ oc get cm -n openshift-windows-machine-config-operator  windows-services-7.1.0-d4ffecd -o yaml
            apiVersion: v1
            data:
              files: '[]'
              services: '[{"name":"containerd","path":"C:\\k\\containerd\\containerd.exe --config
                C:\\k\\containerd\\containerd_conf.toml --log-file C:\\var\\log\\containerd\\containerd.log
                --run-service --log-level info","powershellPreScripts":[{"path":"C:\\Temp\\windows-defender-exclusion.ps1
                -BinPath C:\\k\\containerd\\containerd.exe"}],"bootstrap":true,"priority":0},{"name":"kubelet","path":"C:\\k\\kubelet.exe
                --config=C:\\k\\kubelet.conf --bootstrap-kubeconfig=C:\\k\\bootstrap-kubeconfig
                --kubeconfig=C:\\k\\kubeconfig --cert-dir=c:\\var\\lib\\kubelet\\pki\\ --windows-service
                --logtostderr=false --log-file=C:\\var\\log\\kubelet\\kubelet.log --register-with-taints=os=Windows:NoSchedule
                --node-labels=node.openshift.io/os_id=Windows --container-runtime=remote --container-runtime-endpoint=npipe://./pipe/containerd-containerd
                --resolv-conf= --windows-priorityclass=ABOVE_NORMAL_PRIORITY_CLASS --v=2 --cloud-provider=azure
                --cloud-config=C:\\k\\cloud.conf","dependencies":["containerd"],"bootstrap":true,"priority":1},{"name":"windows_exporter","path":"C:\\k\\windows_exporter.exe
                --collectors.enabled cpu,cs,logical_disk,net,os,service,system,textfile,container,memory,cpu_info","bootstrap":false,"priority":2},{"name":"hybrid-overlay-node","path":"C:\\k\\hybrid-overlay-node.exe
                --node NODE_NAME --k8s-kubeconfig C:\\k\\kubeconfig --windows-service --logfile
                C:\\var\\log\\hybrid-overlay\\hybrid-overlay.log","nodeVariablesInCommand":[{"name":"NODE_NAME","nodeObjectJsonPath":"{.metadata.name}"}],"dependencies":["kubelet"],"bootstrap":false,"priority":2},{"name":"kube-proxy","path":"C:\\k\\kube-proxy.exe
                --windows-service --proxy-mode=kernelspace --feature-gates=WinOverlay=true --hostname-override=NODE_NAME
                --kubeconfig=C:\\k\\kubeconfig --cluster-cidr=NODE_SUBNET --log-dir=C:\\var\\log\\kube-proxy
                --logtostderr=false --network-name=OVNKubernetesHybridOverlayNetwork --source-vip=ENDPOINT_IP
                --enable-dsr=false --v=2","nodeVariablesInCommand":[{"name":"NODE_NAME","nodeObjectJsonPath":"{.metadata.name}"},{"name":"NODE_SUBNET","nodeObjectJsonPath":"{.metadata.annotations.k8s\\.ovn\\.org/hybrid-overlay-node-subnet}"}],"powershellPreScripts":[{"variableName":"ENDPOINT_IP","path":"C:\\Temp\\network-conf.ps1"}],"dependencies":["hybrid-overlay-node"],"bootstrap":false,"priority":3}]'
            immutable: true
            kind: ConfigMap
            metadata:
              creationTimestamp: "2023-06-09T11:51:44Z"
              name: windows-services-7.1.0-d4ffecd
              namespace: openshift-windows-machine-config-operator
              resourceVersion: "37141"
              uid: 5cc96bc9-72e9-402b-9d5b-e844cb9e0b5b
            

            And I could also confirm that the upgrade from 7.0.1 to 8.0.1 impacts also in platform:none:

            [jfrancoa@localhost byoh-auto]$ oc get nodes
            NAME             STATUS                        ROLES                  AGE     VERSION
            byoh-winc-0      NotReady,SchedulingDisabled   worker                 87m     v1.25.0-2653+a34b9e9499e6c3
            byoh-winc-1      Ready                         worker                 90m     v1.25.0-2653+a34b9e9499e6c3
            ip-10-0-51-165   Ready                         control-plane,master   4h31m   v1.26.5+0001a21
            ip-10-0-56-78    Ready                         worker                 4h19m   v1.26.5+0001a21
            ip-10-0-58-100   Ready                         control-plane,master   4h32m   v1.26.5+0001a21
            ip-10-0-62-227   Ready                         worker                 4h17m   v1.26.5+0001a21
            ip-10-0-65-98    Ready                         worker                 4h19m   v1.26.5+0001a21
            ip-10-0-72-218   Ready                         control-plane,master   4h31m   v1.26.5+0001a21
            [jfrancoa@localhost byoh-auto]$ oc get cm -n openshift-windows-machine-config-operator 
            NAME                                   DATA   AGE
            kube-root-ca.crt                       1      4h5m
            openshift-service-ca.crt               1      4h5m
            windows-instances                      2      96m
            windows-machine-config-operator-lock   0      5m51s
            windows-services-7.0.1-bc9473b         2      4h4m
            windows-services-8.0.1-01a3618         2      5m47s
            
            {"level":"info","ts":"2023-06-09T15:43:41Z","logger":"controllers.configmap","msg":"instance requires upgrade","node":"by
            oh-winc-0","version":"7.0.1-bc9473b","expected version":"8.0.1-01a3618"}
            {"level":"info","ts":"2023-06-09T15:43:51Z","logger":"wc 10.0.53.98","msg":"deconfiguring"}
            {"level":"info","ts":"2023-06-09T15:43:55Z","logger":"wc 10.0.53.98","msg":"removing HNS networks"}
            {"level":"info","ts":"2023-06-09T15:44:29Z","logger":"wc 10.0.53.98","msg":"removing directories"}
            {"level":"error","ts":"2023-06-09T15:44:31Z","logger":"wc 10.0.53.98","msg":"error running","cmd":"powershell.exe -NonInt
            eractive -ExecutionPolicy Bypass \"if(Test-Path C:\\var\\log) {Remove-Item -Recurse -Force C:\\var\\log}\"","out":"Remove
            -Item : Cannot remove item C:\\var\\log\\containerd\\containerd.log: The process cannot access the file \r\n'containerd.l
            og' because it is being used by another process.\r\nAt line:1 char:27\r\n+ if(Test-Path C:\\var\\log) {Remove-Item -Recur
            se -Force C:\\var\\log}\r\n+                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n    + CategoryInfo      
                : WriteError: (containerd.log:FileInfo) [Remove-Item], IOException\r\n    + FullyQualifiedErrorId : RemoveFileSystemI
            temIOError,Microsoft.PowerShell.Commands.RemoveItemCommand\r\nRemove-Item : Cannot remove item C:\\var\\log\\containerd: 
            

            Jose Luis Franco Arza (Inactive) added a comment - Confirmed. WMCO 7.1.0 includes containerd service, while 7.0.1 does not include it: $ oc get cm -n openshift-windows-machine-config- operator windows-services-7.1.0-d4ffecd -o yaml apiVersion: v1 data: files: '[]' services: '[{ "name" : "containerd" , "path" :"C:\\k\\containerd\\containerd.exe --config C:\\k\\containerd\\containerd_conf.toml --log-file C:\\ var \\log\\containerd\\containerd.log --run-service --log-level info "," powershellPreScripts ":[{" path ":" C:\\Temp\\windows-defender-exclusion.ps1 -BinPath C:\\k\\containerd\\containerd.exe "}]," bootstrap ": true ," priority ":0},{" name ":" kubelet "," path ":" C:\\k\\kubelet.exe --config=C:\\k\\kubelet.conf --bootstrap-kubeconfig=C:\\k\\bootstrap-kubeconfig --kubeconfig=C:\\k\\kubeconfig --cert-dir=c:\\ var \\lib\\kubelet\\pki\\ --windows-service --logtostderr= false --log-file=C:\\ var \\log\\kubelet\\kubelet.log --register-with-taints=os=Windows:NoSchedule --node-labels=node.openshift.io/os_id=Windows --container-runtime=remote --container-runtime-endpoint=npipe: //./pipe/containerd-containerd --resolv-conf= --windows-priorityclass=ABOVE_NORMAL_PRIORITY_CLASS --v=2 --cloud-provider=azure --cloud-config=C:\\k\\cloud.conf "," dependencies ":[" containerd "]," bootstrap ": true ," priority ":1},{" name ":" windows_exporter "," path ":" C:\\k\\windows_exporter.exe --collectors.enabled cpu,cs,logical_disk,net,os,service,system,textfile,container,memory,cpu_info "," bootstrap ": false ," priority ":2},{" name ":" hybrid-overlay-node "," path ":" C:\\k\\hybrid-overlay-node.exe --node NODE_NAME --k8s-kubeconfig C:\\k\\kubeconfig --windows-service --logfile C:\\ var \\log\\hybrid-overlay\\hybrid-overlay.log "," nodeVariablesInCommand ":[{" name ":" NODE_NAME "," nodeObjectJsonPath ":" {.metadata.name} "}]," dependencies ":[" kubelet "]," bootstrap ": false ," priority ":2},{" name ":" kube-proxy "," path ":" C:\\k\\kube-proxy.exe --windows-service --proxy-mode=kernelspace --feature-gates=WinOverlay= true --hostname-override=NODE_NAME --kubeconfig=C:\\k\\kubeconfig --cluster-cidr=NODE_SUBNET --log-dir=C:\\ var \\log\\kube-proxy --logtostderr= false --network-name=OVNKubernetesHybridOverlayNetwork --source-vip=ENDPOINT_IP --enable-dsr= false --v=2 "," nodeVariablesInCommand ":[{" name ":" NODE_NAME "," nodeObjectJsonPath ":" {.metadata.name} "},{" name ":" NODE_SUBNET "," nodeObjectJsonPath ":" {.metadata.annotations.k8s\\.ovn\\.org/hybrid-overlay-node-subnet} "}]," powershellPreScripts ":[{" variableName ":" ENDPOINT_IP "," path ":" C:\\Temp\\network-conf.ps1 "}]," dependencies ":[" hybrid-overlay-node "]," bootstrap ": false ," priority":3}]' immutable: true kind: ConfigMap metadata: creationTimestamp: "2023-06-09T11:51:44Z" name: windows-services-7.1.0-d4ffecd namespace: openshift-windows-machine-config- operator resourceVersion: "37141" uid: 5cc96bc9-72e9-402b-9d5b-e844cb9e0b5b And I could also confirm that the upgrade from 7.0.1 to 8.0.1 impacts also in platform:none: [jfrancoa@localhost byoh-auto]$ oc get nodes NAME STATUS ROLES AGE VERSION byoh-winc-0 NotReady,SchedulingDisabled worker 87m v1.25.0-2653+a34b9e9499e6c3 byoh-winc-1 Ready worker 90m v1.25.0-2653+a34b9e9499e6c3 ip-10-0-51-165 Ready control-plane,master 4h31m v1.26.5+0001a21 ip-10-0-56-78 Ready worker 4h19m v1.26.5+0001a21 ip-10-0-58-100 Ready control-plane,master 4h32m v1.26.5+0001a21 ip-10-0-62-227 Ready worker 4h17m v1.26.5+0001a21 ip-10-0-65-98 Ready worker 4h19m v1.26.5+0001a21 ip-10-0-72-218 Ready control-plane,master 4h31m v1.26.5+0001a21 [jfrancoa@localhost byoh-auto]$ oc get cm -n openshift-windows-machine-config- operator NAME DATA AGE kube-root-ca.crt 1 4h5m openshift-service-ca.crt 1 4h5m windows-instances 2 96m windows-machine-config- operator -lock 0 5m51s windows-services-7.0.1-bc9473b 2 4h4m windows-services-8.0.1-01a3618 2 5m47s { "level" : "info" , "ts" : "2023-06-09T15:43:41Z" , "logger" : "controllers.configmap" , "msg" : "instance requires upgrade" , "node" :"by oh-winc-0 "," version ":" 7.0.1-bc9473b "," expected version ":" 8.0.1-01a3618"} { "level" : "info" , "ts" : "2023-06-09T15:43:51Z" , "logger" : "wc 10.0.53.98" , "msg" : "deconfiguring" } { "level" : "info" , "ts" : "2023-06-09T15:43:55Z" , "logger" : "wc 10.0.53.98" , "msg" : "removing HNS networks" } { "level" : "info" , "ts" : "2023-06-09T15:44:29Z" , "logger" : "wc 10.0.53.98" , "msg" : "removing directories" } { "level" : "error" , "ts" : "2023-06-09T15:44:31Z" , "logger" : "wc 10.0.53.98" , "msg" : "error running" , "cmd" :"powershell.exe -NonInt eractive -ExecutionPolicy Bypass \ " if (Test-Path C:\\ var \\log) {Remove-Item -Recurse -Force C:\\ var \\log}\" "," out ":" Remove -Item : Cannot remove item C:\\ var \\log\\containerd\\containerd.log: The process cannot access the file \r\n'containerd.l og' because it is being used by another process.\r\nAt line:1 char :27\r\n+ if (Test-Path C:\\ var \\log) {Remove-Item -Recur se -Force C:\\ var \\log}\r\n+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n + CategoryInfo : WriteError: (containerd.log:FileInfo) [Remove-Item], IOException\r\n + FullyQualifiedErrorId : RemoveFileSystemI temIOError,Microsoft.PowerShell.Commands.RemoveItemCommand\r\nRemove-Item : Cannot remove item C:\\ var \\log\\containerd:

            Ah! found it:

            2023-06-09T10:09:20Z    DEBUG   wc 10.0.128.6   file already exists on VM with expected content {"file": "C:\\k\\\\wicd-kubeconfig"}
            2023-06-09T10:09:21Z    DEBUG   wc 10.0.128.6   run {"cmd": "powershell.exe -NonInteractive -ExecutionPolicy Bypass \"C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshift-windows-machine-config-operator\"", "out": "I0609 10:09:21.039925    5936 cleanup.go:81] removing services tied to version: 7.0.1-bc9473b\nI0609 10:09:21.040638    5936 cleanup.go:119] removed services: [\"kube-proxy\" \"hybrid-overlay-node\" \"windows_exporter\" \"kubelet\"]\n"}
            2023-06-09T10:09:21Z    INFO    wc 10.0.128.6   removing HNS networks
            

            No sign of containerd when removing the services.

            Which occurs because it isn't present in the windows-instances configmap:

            apiVersion: v1
            data:
              files: '[]'
              services: '[{"name":"kubelet","path":"C:\\k\\kubelet.exe --config=C:\\k\\kubelet.conf
                --bootstrap-kubeconfig=C:\\k\\bootstrap-kubeconfig --kubeconfig=C:\\k\\kubeconfig
                --cert-dir=c:\\var\\lib\\kubelet\\pki\\ --windows-service --logtostderr=false
                --log-file=C:\\var\\log\\kubelet\\kubelet.log --register-with-taints=os=Windows:NoSchedule
                --node-labels=node.openshift.io/os_id=Windows --container-runtime=remote --container-runtime-endpoint=npipe://./pipe/containerd-containerd
                --resolv-conf= --v=4 --cloud-provider=gce --cloud-config=C:\\k\\cloud.conf --hostname-override=HOSTNAME_OVERRIDE","powershellPreScripts":[{"variableName":"HOSTNAME_OVERRIDE","path":"C:\\Temp\\gcp-get-hostname.ps1"}],"dependencies":["containerd"],"bootstrap":true,"priority":0},{"name":"windows_exporter","path":"C:\\k\\windows_exporter.exe
                --collectors.enabled cpu,cs,logical_disk,net,os,service,system,textfile,container,memory,cpu_info","bootstrap":false,"priority":1},{"name":"hybrid-overlay-node","path":"C:\\k\\hybrid-overlay-node.exe
                --node NODE_NAME --k8s-kubeconfig C:\\k\\kubeconfig --windows-service --logfile
                C:\\var\\log\\hybrid-overlay\\hybrid-overlay.log --loglevel 5","nodeVariablesInCommand":[{"name":"NODE_NAME","nodeObjectJsonPath":"{.metadata.name}"}],"dependencies":["kubelet"],"bootstrap":false,"priority":1},{"name":"kube-proxy","path":"C:\\k\\kube-proxy.exe
                --windows-service --proxy-mode=kernelspace --feature-gates=WinOverlay=true --hostname-override=NODE_NAME
                --kubeconfig=C:\\k\\kubeconfig --cluster-cidr=NODE_SUBNET --log-dir=C:\\var\\log\\kube-proxy\\
                --logtostderr=false --network-name=OVNKubernetesHybridOverlayNetwork --source-vip=ENDPOINT_IP
                --enable-dsr=false --v=4","nodeVariablesInCommand":[{"name":"NODE_NAME","nodeObjectJsonPath":"{.metadata.name}"},{"name":"NODE_SUBNET","nodeObjectJsonPath":"{.metadata.annotations.k8s\\.ovn\\.org/hybrid-overlay-node-subnet}"}],"powershellPreScripts":[{"variableName":"ENDPOINT_IP","path":"C:\\Temp\\network-conf.ps1"}],"dependencies":["hybrid-overlay-node"],"bootstrap":false,"priority":2}]'
            immutable: true
            kind: ConfigMap
            metadata:
              creationTimestamp: "2023-06-09T09:06:58Z"
              name: windows-services-7.0.1-bc9473b
              namespace: openshift-windows-machine-config-operator
              resourceVersion: "62176"
              uid: 3d7f532a-4d67-4e45-9804-5248ee3e7942
            

            Jose Luis Franco Arza (Inactive) added a comment - - edited Ah! found it: 2023-06-09T10:09:20Z DEBUG wc 10.0.128.6 file already exists on VM with expected content { "file" : "C:\\k\\\\wicd-kubeconfig" } 2023-06-09T10:09:21Z DEBUG wc 10.0.128.6 run { "cmd" : "powershell.exe -NonInteractive -ExecutionPolicy Bypass \" C:\\k\\windows-instance-config-daemon.exe cleanup --kubeconfig C:\\k\\wicd-kubeconfig --namespace openshift-windows-machine-config- operator \ "", " out ": " I0609 10:09:21.039925 5936 cleanup.go:81] removing services tied to version: 7.0.1-bc9473b\nI0609 10:09:21.040638 5936 cleanup.go:119] removed services: [\ "kube-proxy\" \ "hybrid-overlay-node\" \ "windows_exporter\" \ "kubelet\" ]\n"} 2023-06-09T10:09:21Z INFO wc 10.0.128.6 removing HNS networks No sign of containerd when removing the services. Which occurs because it isn't present in the windows-instances configmap: apiVersion: v1 data: files: '[]' services: '[{ "name" : "kubelet" , "path" :"C:\\k\\kubelet.exe --config=C:\\k\\kubelet.conf --bootstrap-kubeconfig=C:\\k\\bootstrap-kubeconfig --kubeconfig=C:\\k\\kubeconfig --cert-dir=c:\\ var \\lib\\kubelet\\pki\\ --windows-service --logtostderr= false --log-file=C:\\ var \\log\\kubelet\\kubelet.log --register-with-taints=os=Windows:NoSchedule --node-labels=node.openshift.io/os_id=Windows --container-runtime=remote --container-runtime-endpoint=npipe: //./pipe/containerd-containerd --resolv-conf= --v=4 --cloud-provider=gce --cloud-config=C:\\k\\cloud.conf --hostname-override=HOSTNAME_OVERRIDE "," powershellPreScripts ":[{" variableName ":" HOSTNAME_OVERRIDE "," path ":" C:\\Temp\\gcp-get-hostname.ps1 "}]," dependencies ":[" containerd "]," bootstrap ": true ," priority ":0},{" name ":" windows_exporter "," path ":" C:\\k\\windows_exporter.exe --collectors.enabled cpu,cs,logical_disk,net,os,service,system,textfile,container,memory,cpu_info "," bootstrap ": false ," priority ":1},{" name ":" hybrid-overlay-node "," path ":" C:\\k\\hybrid-overlay-node.exe --node NODE_NAME --k8s-kubeconfig C:\\k\\kubeconfig --windows-service --logfile C:\\ var \\log\\hybrid-overlay\\hybrid-overlay.log --loglevel 5 "," nodeVariablesInCommand ":[{" name ":" NODE_NAME "," nodeObjectJsonPath ":" {.metadata.name} "}]," dependencies ":[" kubelet "]," bootstrap ": false ," priority ":1},{" name ":" kube-proxy "," path ":" C:\\k\\kube-proxy.exe --windows-service --proxy-mode=kernelspace --feature-gates=WinOverlay= true --hostname-override=NODE_NAME --kubeconfig=C:\\k\\kubeconfig --cluster-cidr=NODE_SUBNET --log-dir=C:\\ var \\log\\kube-proxy\\ --logtostderr= false --network-name=OVNKubernetesHybridOverlayNetwork --source-vip=ENDPOINT_IP --enable-dsr= false --v=4 "," nodeVariablesInCommand ":[{" name ":" NODE_NAME "," nodeObjectJsonPath ":" {.metadata.name} "},{" name ":" NODE_SUBNET "," nodeObjectJsonPath ":" {.metadata.annotations.k8s\\.ovn\\.org/hybrid-overlay-node-subnet} "}]," powershellPreScripts ":[{" variableName ":" ENDPOINT_IP "," path ":" C:\\Temp\\network-conf.ps1 "}]," dependencies ":[" hybrid-overlay-node "]," bootstrap ": false ," priority":2}]' immutable: true kind: ConfigMap metadata: creationTimestamp: "2023-06-09T09:06:58Z" name: windows-services-7.0.1-bc9473b namespace: openshift-windows-machine-config- operator resourceVersion: "62176" uid: 3d7f532a-4d67-4e45-9804-5248ee3e7942

            It looks like the problem isn't related to the Terraform BYOH nodes in the end. I have managed to reproduce it two times on a GCP IPI cluster using 4.13.2 and WMCO 8.0.1 with a MachineSet BYOH node. The failure occured on both MachineSet BYOH nodes:

            $ oc get machineset -n openshift-machine-api
            NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
            jfrancoa-0906-b877b-windows-byoh-a     2         2         1       1           110m
            jfrancoa-0906-b877b-windows-worker-a   2         2         2       2           131m
            jfrancoa-0906-b877b-worker-a           1         1         1       1           166m
            jfrancoa-0906-b877b-worker-b           1         1         1       1           166m
            jfrancoa-0906-b877b-worker-c           0         0                             166m
            jfrancoa-0906-b877b-worker-f           0         0                             166m
            
            $ oc get machine -n openshift-machine-api
            NAME                                         PHASE     TYPE            REGION        ZONE            AGE
            jfrancoa-0906-b877b-master-0                 Running   n2-standard-4   us-central1   us-central1-a   166m
            jfrancoa-0906-b877b-master-1                 Running   n2-standard-4   us-central1   us-central1-b   166m
            jfrancoa-0906-b877b-master-2                 Running   n2-standard-4   us-central1   us-central1-c   166m
            jfrancoa-0906-b877b-windows-byoh-a-8zl2k     Running   n1-standard-4   us-central1   us-central1-a   110m
            jfrancoa-0906-b877b-windows-byoh-a-j8d94     Running   n1-standard-4   us-central1   us-central1-a   110m
            jfrancoa-0906-b877b-windows-worker-a-9hwb7   Running   n1-standard-4   us-central1   us-central1-a   18m
            jfrancoa-0906-b877b-windows-worker-a-prxll   Running   n1-standard-4   us-central1   us-central1-a   8m56s
            jfrancoa-0906-b877b-worker-a-hdxd4           Running   n2-standard-4   us-central1   us-central1-a   163m
            jfrancoa-0906-b877b-worker-b-bcvfq           Running   n2-standard-4   us-central1   us-central1-b   163m
            
            $ oc get cm windows-instances -n openshift-windows-machine-config-operator -o yaml
            apiVersion: v1
            data:
              10.0.128.6: username=Administrator
              10.0.128.7: username=Administrator
            kind: ConfigMap
            
            

            Right after WMCO 8.0.1 installed one of the byoh nodes got into NotReady, SchedulingDisabled and the log about not being able to delete containerd.log appeared:

            [jfrancoa@localhost byoh-auto]$ oc edit catalogsource wmco -n openshift-marketplace                                                                                                                                                                                    
            catalogsource.operators.coreos.com/wmco edited     
            
            [jfrancoa@localhost byoh-auto]$ oc get nodes                      
            NAME                                                         STATUS                        ROLES                  AGE    VERSION
            jfrancoa-0906-b877b-master-0.c.openshift-qe.internal         Ready                         control-plane,master   149m   v1.26.5+0001a21
            jfrancoa-0906-b877b-master-1.c.openshift-qe.internal         Ready                         control-plane,master   150m   v1.26.5+0001a21
            jfrancoa-0906-b877b-master-2.c.openshift-qe.internal         Ready                         control-plane,master   150m   v1.26.5+0001a21
            jfrancoa-0906-b877b-windows-byoh-a-8zl2k                     Ready                         worker                 87m    v1.25.0-2653+a34b9e9499e6c3
            jfrancoa-0906-b877b-windows-byoh-a-j8d94                     NotReady,SchedulingDisabled   worker                 85m    v1.25.0-2653+a34b9e9499e6c3
            jfrancoa-0906-b877b-windows-worker-a-rjhgf                   Ready                         worker                 107m   v1.25.0-2653+a34b9e9499e6c3
            jfrancoa-0906-b877b-worker-a-hdxd4.c.openshift-qe.internal   Ready                         worker                 140m   v1.26.5+00
            01a21                                                            
            jfrancoa-0906-b877b-worker-b-bcvfq.c.openshift-qe.internal   Ready                         worker                 140m   v1.26.5+00
            01a21                                                            
            
            

            After few minutes, the second BYOH node joined too:

            [jfrancoa@localhost byoh-auto]$ oc get nodes                      
            NAME                                                         STATUS                        ROLES                  AGE    VERSION
            jfrancoa-0906-b877b-master-0.c.openshift-qe.internal         Ready                         control-plane,master   154m   v1.26.5+00
            01a21                                                            
            jfrancoa-0906-b877b-master-1.c.openshift-qe.internal         Ready                         control-plane,master   155m   v1.26.5+0001a21
            jfrancoa-0906-b877b-master-2.c.openshift-qe.internal         Ready                         control-plane,master   155m   v1.26.5+0001a21
            jfrancoa-0906-b877b-windows-byoh-a-8zl2k                     NotReady,SchedulingDisabled   worker                 92m    v1.25.0-2653+a34b9e9499e6c3
            jfrancoa-0906-b877b-windows-byoh-a-j8d94                     NotReady,SchedulingDisabled   worker                 90m    v1.25.0-2653+a34b9e9499e6c3
            jfrancoa-0906-b877b-windows-worker-a-rjhgf                   Ready                         worker                 112m   v1.25.0-2653+a34b9e9499e6c3
            jfrancoa-0906-b877b-worker-a-hdxd4.c.openshift-qe.internal   Ready                         worker                 145m   v1.26.5+0001a21
            jfrancoa-0906-b877b-worker-b-bcvfq.c.openshift-qe.internal   Ready                         worker                 145m   v1.26.5+0001a21
            

            Then, I jumped into the first node which failed* jfrancoa-0906-b877b-windows-byoh-a-j8d94* (10.0.128.7) and stopped the containerd service:

            PS C:\var\log> sc.exe stop containerd                            
                                                                             
            SERVICE_NAME: containerd                                         
                    TYPE               : 10  WIN32_OWN_PROCESS                                                                                 
                    STATE              : 3  STOP_PENDING                                                                                                                                                                                                                           
                                            (NOT_STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)                                                                                                                                                                                        
                    WIN32_EXIT_CODE    : 0  (0x0)                            
                    SERVICE_EXIT_CODE  : 0  (0x0)                      
                    CHECKPOINT         : 0x0                           
                    WAIT_HINT          : 0x0                           
            
            

            After doing that the node could progress and the upgrade succeeded for that BYOH node:

            [jfrancoa@localhost byoh-auto]$ oc get nodes
            NAME                                                         STATUS                        ROLES                  AGE     VERSION
            jfrancoa-0906-b877b-master-0.c.openshift-qe.internal         Ready                         control-plane,master   165m    v1.26.5+0001a21
            jfrancoa-0906-b877b-master-1.c.openshift-qe.internal         Ready                         control-plane,master   166m    v1.26.5+0001a21
            jfrancoa-0906-b877b-master-2.c.openshift-qe.internal         Ready                         control-plane,master   166m    v1.26.5+0001a21
            jfrancoa-0906-b877b-windows-byoh-a-8zl2k                     NotReady,SchedulingDisabled   worker                 103m    v1.25.0-2653+a34b9e9499e6c3
            jfrancoa-0906-b877b-windows-byoh-a-j8d94                     Ready                         worker                 100m    v1.26.3+b404935
            jfrancoa-0906-b877b-windows-worker-a-9hwb7                   Ready                         worker                 9m46s   v1.26.3+b404935
            jfrancoa-0906-b877b-windows-worker-a-prxll                   Ready                         worker                 99s     v1.26.3+b404935
            jfrancoa-0906-b877b-worker-a-hdxd4.c.openshift-qe.internal   Ready                         worker                 156m    v1.26.5+0001a21
            jfrancoa-0906-b877b-worker-b-bcvfq.c.openshift-qe.internal   Ready                         worker                 156m    v1.26.5+0001a21
            

            However, on the other node (jfrancoa-0906-b877b-windows-byoh-a-8zl2k) which I didn't stop containerd the error keeps on occuring and the upgrade hasn't gone through yet. As you can see, both windows-services cm are still present:

            ]$ oc get cm -n openshift-windows-machine-config-operator 
            NAME                                   DATA   AGE
            kube-root-ca.crt                       1      139m
            openshift-service-ca.crt               1      139m
            windows-instances                      2      115m
            windows-machine-config-operator-lock   0      20m
            windows-services-7.0.1-bc9473b         2      77m
            windows-services-8.0.1-01a3618         2      20m
            

            I enalbed debugLogging right after the CSV was changed during the upgrade and could store the debug logs for WMCO. Attaching them:
            wmco.log

            Jose Luis Franco Arza (Inactive) added a comment - It looks like the problem isn't related to the Terraform BYOH nodes in the end. I have managed to reproduce it two times on a GCP IPI cluster using 4.13.2 and WMCO 8.0.1 with a MachineSet BYOH node. The failure occured on both MachineSet BYOH nodes: $ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE jfrancoa-0906-b877b-windows-byoh-a 2 2 1 1 110m jfrancoa-0906-b877b-windows-worker-a 2 2 2 2 131m jfrancoa-0906-b877b-worker-a 1 1 1 1 166m jfrancoa-0906-b877b-worker-b 1 1 1 1 166m jfrancoa-0906-b877b-worker-c 0 0 166m jfrancoa-0906-b877b-worker-f 0 0 166m $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jfrancoa-0906-b877b-master-0 Running n2-standard-4 us-central1 us-central1-a 166m jfrancoa-0906-b877b-master-1 Running n2-standard-4 us-central1 us-central1-b 166m jfrancoa-0906-b877b-master-2 Running n2-standard-4 us-central1 us-central1-c 166m jfrancoa-0906-b877b-windows-byoh-a-8zl2k Running n1-standard-4 us-central1 us-central1-a 110m jfrancoa-0906-b877b-windows-byoh-a-j8d94 Running n1-standard-4 us-central1 us-central1-a 110m jfrancoa-0906-b877b-windows-worker-a-9hwb7 Running n1-standard-4 us-central1 us-central1-a 18m jfrancoa-0906-b877b-windows-worker-a-prxll Running n1-standard-4 us-central1 us-central1-a 8m56s jfrancoa-0906-b877b-worker-a-hdxd4 Running n2-standard-4 us-central1 us-central1-a 163m jfrancoa-0906-b877b-worker-b-bcvfq Running n2-standard-4 us-central1 us-central1-b 163m $ oc get cm windows-instances -n openshift-windows-machine-config- operator -o yaml apiVersion: v1 data: 10.0.128.6: username=Administrator 10.0.128.7: username=Administrator kind: ConfigMap Right after WMCO 8.0.1 installed one of the byoh nodes got into NotReady, SchedulingDisabled and the log about not being able to delete containerd.log appeared: [jfrancoa@localhost byoh-auto]$ oc edit catalogsource wmco -n openshift-marketplace catalogsource.operators.coreos.com/wmco edited [jfrancoa@localhost byoh-auto]$ oc get nodes NAME STATUS ROLES AGE VERSION jfrancoa-0906-b877b-master-0.c.openshift-qe.internal Ready control-plane,master 149m v1.26.5+0001a21 jfrancoa-0906-b877b-master-1.c.openshift-qe.internal Ready control-plane,master 150m v1.26.5+0001a21 jfrancoa-0906-b877b-master-2.c.openshift-qe.internal Ready control-plane,master 150m v1.26.5+0001a21 jfrancoa-0906-b877b-windows-byoh-a-8zl2k Ready worker 87m v1.25.0-2653+a34b9e9499e6c3 jfrancoa-0906-b877b-windows-byoh-a-j8d94 NotReady,SchedulingDisabled worker 85m v1.25.0-2653+a34b9e9499e6c3 jfrancoa-0906-b877b-windows-worker-a-rjhgf Ready worker 107m v1.25.0-2653+a34b9e9499e6c3 jfrancoa-0906-b877b-worker-a-hdxd4.c.openshift-qe.internal Ready worker 140m v1.26.5+00 01a21 jfrancoa-0906-b877b-worker-b-bcvfq.c.openshift-qe.internal Ready worker 140m v1.26.5+00 01a21 After few minutes, the second BYOH node joined too: [jfrancoa@localhost byoh-auto]$ oc get nodes NAME STATUS ROLES AGE VERSION jfrancoa-0906-b877b-master-0.c.openshift-qe.internal Ready control-plane,master 154m v1.26.5+00 01a21 jfrancoa-0906-b877b-master-1.c.openshift-qe.internal Ready control-plane,master 155m v1.26.5+0001a21 jfrancoa-0906-b877b-master-2.c.openshift-qe.internal Ready control-plane,master 155m v1.26.5+0001a21 jfrancoa-0906-b877b-windows-byoh-a-8zl2k NotReady,SchedulingDisabled worker 92m v1.25.0-2653+a34b9e9499e6c3 jfrancoa-0906-b877b-windows-byoh-a-j8d94 NotReady,SchedulingDisabled worker 90m v1.25.0-2653+a34b9e9499e6c3 jfrancoa-0906-b877b-windows-worker-a-rjhgf Ready worker 112m v1.25.0-2653+a34b9e9499e6c3 jfrancoa-0906-b877b-worker-a-hdxd4.c.openshift-qe.internal Ready worker 145m v1.26.5+0001a21 jfrancoa-0906-b877b-worker-b-bcvfq.c.openshift-qe.internal Ready worker 145m v1.26.5+0001a21 Then, I jumped into the first node which failed* jfrancoa-0906-b877b-windows-byoh-a-j8d94* (10.0.128.7) and stopped the containerd service: PS C:\ var \log> sc.exe stop containerd SERVICE_NAME: containerd TYPE : 10 WIN32_OWN_PROCESS STATE : 3 STOP_PENDING (NOT_STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN) WIN32_EXIT_CODE : 0 (0x0) SERVICE_EXIT_CODE : 0 (0x0) CHECKPOINT : 0x0 WAIT_HINT : 0x0 After doing that the node could progress and the upgrade succeeded for that BYOH node: [jfrancoa@localhost byoh-auto]$ oc get nodes NAME STATUS ROLES AGE VERSION jfrancoa-0906-b877b-master-0.c.openshift-qe.internal Ready control-plane,master 165m v1.26.5+0001a21 jfrancoa-0906-b877b-master-1.c.openshift-qe.internal Ready control-plane,master 166m v1.26.5+0001a21 jfrancoa-0906-b877b-master-2.c.openshift-qe.internal Ready control-plane,master 166m v1.26.5+0001a21 jfrancoa-0906-b877b-windows-byoh-a-8zl2k NotReady,SchedulingDisabled worker 103m v1.25.0-2653+a34b9e9499e6c3 jfrancoa-0906-b877b-windows-byoh-a-j8d94 Ready worker 100m v1.26.3+b404935 jfrancoa-0906-b877b-windows-worker-a-9hwb7 Ready worker 9m46s v1.26.3+b404935 jfrancoa-0906-b877b-windows-worker-a-prxll Ready worker 99s v1.26.3+b404935 jfrancoa-0906-b877b-worker-a-hdxd4.c.openshift-qe.internal Ready worker 156m v1.26.5+0001a21 jfrancoa-0906-b877b-worker-b-bcvfq.c.openshift-qe.internal Ready worker 156m v1.26.5+0001a21 However, on the other node (jfrancoa-0906-b877b-windows-byoh-a-8zl2k) which I didn't stop containerd the error keeps on occuring and the upgrade hasn't gone through yet. As you can see, both windows-services cm are still present: ]$ oc get cm -n openshift-windows-machine-config- operator NAME DATA AGE kube-root-ca.crt 1 139m openshift-service-ca.crt 1 139m windows-instances 2 115m windows-machine-config- operator -lock 0 20m windows-services-7.0.1-bc9473b 2 77m windows-services-8.0.1-01a3618 2 20m I enalbed debugLogging right after the CSV was changed during the upgrade and could store the debug logs for WMCO. Attaching them: wmco.log

              rh-ee-ssoto Sebastian Soto
              rrasouli Aharon Rasouli
              Aharon Rasouli Aharon Rasouli
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: