-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.21
-
None
-
None
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
WMCO 10.21.0 fails to upgrade BYOH (Bring Your Own Host) Windows nodes smoothly during OCP 4.20 to 4.21 cluster upgrade. The upgrade process encounters multiple errors including file permission issues, concurrent upgrade limit errors, and SFTP transfer failures. This causes BYOH node upgrade to take 30-40 minutes instead of the expected 5-10 minutes, resulting in test timeouts and poor user experience. The issue is specific to BYOH nodes managed via ConfigMap. Machine-managed Windows nodes upgrade normally without these errors
Version-Release number of selected component (if applicable):
- WMCO: 10.21.0 - OCP: 4.21.0-0.nightly-2026-01-22-192129 - Platform: Azure IPI - Windows OS: Windows Server 2022 Datacenter (10.0.20348.4648) - Kubernetes: v1.33.3 → v1.34.2
How reproducible:
100% - Reproduced in 3 independent environments: - Environment 367213 (weinliu-5847) - 2026-01-26 - Environment 367328 (weinliu-5858) - 2026-01-27 - Environment 367448 (weinliu-5860) - 2026-01-28
Steps to Reproduce:
1. Deploy OCP 4.20 cluster on Azure IPI with WMCO 10.20.0, Windows 2022 2. Create 2 BYOH Windows nodes using Terraform and register via ConfigMap (windows-instances) 3. Verify both BYOH nodes are Ready 4. Upgrade OCP cluster from 4.20 to 4.21 (triggers WMCO upgrade from 10.20.0 to 10.21.0) 5. Monitor BYOH node upgrade process and WMCO events
Actual results:
1. Multiple InstanceSetupFailure errors occur during BYOH node upgrade: 1. a) File Permission Error: error configuring host with address 10.0.128.9: unable to cleanup the Windows instance: error copying /payload/windows-instance-config-daemon.exe.tar.gz to C:\k: decompression failed with output windows-instance-config-daemon.exe: Can't unlink already-existing object: Permission denied tar.exe: Error exit delayed from previous errors 1. b) Concurrent Upgrade Limit Error: error configuring host with address 10.0.128.9: Cannot mark node byoh-winc-0 as upgrading. Current number of upgrading nodes is (1). Max number of upgrading nodes is (0) 1. c) SFTP Transfer Failure: error configuring host with address 10.0.128.10: unable to cleanup the Windows instance: error copying /payload/windows-instance-config-daemon.exe to C:\k: unable to transfer /payload/windows-instance-config-daemon.exe to remote dir C:\k: error initializing C:\k\windows-instance-config-daemon.exe file on Windows VM: sftp: "Failure" (SSH_FX_FAILURE) 1. d) Timeout Error: error configuring host with address 10.0.128.9: timeout waiting for windowsmachineconfig.openshift.io/reboot-required to be cleared: timed out waiting for the condition 2. Timeline (Environment 367448): - 04:39: Test starts, cluster already at 4.21 - 04:48: WMCO 10.21.0 upgrade begins - 04:48-05:00: Multiple InstanceSetupFailure errors - 05:00: Machine nodes upgraded (12 minutes - normal) - 05:00-05:20: Waiting for BYOH nodes to be Ready - 05:20: Test timeout after 20 minutes waiting for BYOH nodes - 05:20-05:22: Recovery process restores WMCO 10.20.0 - Total test duration: 43m34s (FAILED) 3. Final Status: - BYOH nodes eventually become Ready but exceed 20-minute timeout - Test fails with: "Windows nodes are not ready after waiting up to 20m0s minutes" - Machine-managed nodes upgrade successfully without issues - Multiple retry attempts required for BYOH nodes
Expected results:
1. BYOH nodes should upgrade smoothly without permission, concurrency, or SFTP errors 2. BYOH node upgrade should complete within 5-10 minutes (similar to Machine nodes) 3. No file locking or permission issues when replacing windows-instance-config-daemon.exe 4. Concurrent upgrade limit (maxUnavailable) should be correctly calculated (not 0) 5. Test should pass within the 20-minute timeout window
Additional info:
Impact: - Affected Component: BYOH node management in WMCO 10.21.0 - Affected Users: All users upgrading BYOH Windows nodes to WMCO 10.21.0 - Workaround: None known - requires WMCO code fix - Machine Nodes: NOT affected - only BYOH nodes impacted Test Evidence: Test case: [sig-windows] Windows_Containers Author:rrasouli-NonPreRelease-Longduration-Critical-43832-[upgrade]-Seamless upgrade with BYOH Windows instances [Serial][Disruptive] Failure logs available from 3 environments with consistent reproduction. Suggested Fix: In WMCO's BYOH node configuration logic: 1. Stop windows-instance-config-daemon.exe service before attempting file replacement 2. Fix maxUnavailable calculation to allow proper concurrent upgrades 3. Add retry logic with service stop for file copy operations 4. Increase timeout or optimize upgrade process to complete within reasonable timefram LOG: Timeline: I0128 04:50:30.134420 machineset windows has 1 ready replicas, waiting for 2 I0128 04:51:30.127159 Waiting for machineset windows to have 2 ready replicas, currently has 1 I0128 04:52:30.122476 Waiting for machineset windows to have 2 ready replicas, currently has 1 I0128 04:53:30.099873 Waiting for machineset windows to have 2 ready replicas, currently has 1 I0128 04:54:30.108636 Waiting for machineset windows to have 2 ready replicas, currently has 1 I0128 04:55:30.102029 Waiting for machineset windows to have 2 ready replicas, currently has 1 I0128 04:56:30.135380 Waiting for machineset windows to have 2 ready replicas, currently has 1 I0128 04:57:30.132676 Waiting for machineset windows to have 2 ready replicas, currently has 1 I0128 04:58:30.123220 Waiting for machineset windows to have 2 ready replicas, currently has 1 I0128 04:59:30.144057 Waiting for machineset windows to have 2 ready replicas, currently has 1 I0128 05:00:30.165962 Waiting for machineset windows to have 2 ready replicas, currently has 2 I0128 05:00:30.166052 machineset windows has 2 ready replicas I0128 05:00:30.166076 Waiting for 2 Windows nodes to be Ready after upgrade I0128 05:20:30.825082 Upgrade failed - attempting recovery... LOG and Must-gather https://drive.google.com/drive/folders/1VLr-N6B5vjTNveNjz5OLImSUpUcfVK0g