-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
4.20
-
None
-
Quality / Stability / Reliability
-
False
-
-
3
-
None
-
None
-
None
-
None
-
None
-
WINC - Sprint 276
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When adding a Bring-Your-Own-Host (BYOH) Windows node to an existing Azure cluster with MachineSet-managed Windows nodes, the Windows Machine Config Operator (WMCO) enters a continuous reconciliation loop. It fails to reconcile existing healthy Windows nodes, logging the error "unable to decrypt username annotation... invalid passphrase supplied". This appears to be caused by the cloud-private-key secret becoming unavailable, which prevents the operator from decrypting node data. Consequently, the new BYOH node is never configured, existing nodes may become unstable, and the windows-instances ConfigMap is removed.
Version-Release number of selected component (if applicable):
Cloud Provider: Azure WMCO Version: 10.20.0-838f32f BYOH Windows Version: Windows Server 2019 Platform Type: OpenShift Container Platform with existing Windows nodes managed by MachineSets.
How reproducible:
100%
Steps to Reproduce:
1. Have a running OCP cluster on Azure with at least one healthy MachineSet-based Windows node. 2. Use the provided byoh.sh script and accompanying Terraform files (main.tf, variables.tf, windows-vm-bootstrap.tf) to provision a new BYOH Windows Server 2019 virtual machine. in https://gitlab.cee.redhat.com/winc/byoh-auto 3. The script successfully applies the Terraform configuration and creates the windows-instances ConfigMap to trigger WMCO. 4. Observe the WMCO pod logs and the status of the Windows nodes (oc get nodes -l kubernetes.io/os=windows).
Actual results:
- WMCO begins logging continuous reconciliation errors for the existing, previously healthy Windows nodes, stating "unable to decrypt username annotation for node <node-name>: unable to decrypt message using given key: invalid passphrase supplied". - WMCO restarts!! - The log shows errors indicating the cloud-private-key secret cannot be found, which is necessary for decryption: "unable to create signer from private key secret: Secret \"cloud-private-key\" not found" - The operator fails to configure the new BYOH node, with logs showing connection timeouts and file transfer failures like "connection lost" and "sftp: \"Failure\" (SSH_FX_FAILURE)" - The windows-instances ConfigMap, which triggers the BYOH process, is deleted. - The original Windows nodes become unstable, with only one remaining in a Ready state.
Expected results:
- The new BYOH Windows node should be successfully configured by WMCO and join the cluster in a Ready state. - Existing Windows nodes should remain stable and in the Ready state without interruption. - WMCO should not enter an error loop, and the cloud-private-key should remain accessible.
Additional info:
The root cause appears to be the loss or invalidation of the `cloud-private-key` secret within the `openshift-windows-machine-config-operator` namespace. The creation of the windows-instances ConfigMap correctly triggers reconciliation. However, WMCO is then unable to decrypt the annotations on existing nodes because it can't access its private key. This triggers a cascading failure where the operator cannot manage any of its nodes, new or old, and gets stuck in an error loop. The file transfer failures (sftp: "Failure") to the new node suggest a secondary issue, possibly with permissions or the state of the SSH server on the new VM after bootstrapping, but the primary blocker is the decryption failure on the operator side.