Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60996

WMCO enters a reconciliation loop, failing to decrypt existing node annotations after adding a BYOH instance

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Critical Critical
    • None
    • 4.20
    • Windows Containers
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • None
    • None
    • WINC - Sprint 276
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          When adding a Bring-Your-Own-Host (BYOH) Windows node to an existing Azure cluster with MachineSet-managed Windows nodes, the Windows Machine Config Operator (WMCO) enters a continuous reconciliation loop. It fails to reconcile existing healthy Windows nodes, logging the error "unable to decrypt username annotation... invalid passphrase supplied". This appears to be caused by the cloud-private-key secret becoming unavailable, which prevents the operator from decrypting node data. Consequently, the new BYOH node is never configured, existing nodes may become unstable, and the windows-instances ConfigMap is removed.

      Version-Release number of selected component (if applicable):

          Cloud Provider: Azure WMCO 
          Version: 10.20.0-838f32f
          BYOH Windows Version: Windows Server 2019
          Platform Type: OpenShift Container Platform with existing Windows nodes managed by MachineSets.

      How reproducible:

          100%

      Steps to Reproduce:

          1. Have a running OCP cluster on Azure with at least one healthy MachineSet-based Windows node.
          2. Use the provided byoh.sh script and accompanying Terraform files (main.tf, variables.tf, windows-vm-bootstrap.tf) to provision a new BYOH Windows Server 2019 virtual machine. in https://gitlab.cee.redhat.com/winc/byoh-auto 
          3. The script successfully applies the Terraform configuration and creates the windows-instances ConfigMap to trigger WMCO.
          4. Observe the WMCO pod logs and the status of the Windows nodes (oc get nodes -l kubernetes.io/os=windows).     

      Actual results:

      - WMCO begins logging continuous reconciliation errors for the existing, previously healthy Windows nodes, stating "unable to decrypt username annotation for node <node-name>: unable to decrypt message using given key: invalid passphrase supplied".
      - WMCO restarts!!
      - The log shows errors indicating the cloud-private-key secret cannot be found, which is necessary for decryption: "unable to create signer from private key secret: Secret \"cloud-private-key\" not found"
      - The operator fails to configure the new BYOH node, with logs showing connection timeouts and file transfer failures like "connection lost" and "sftp: \"Failure\" (SSH_FX_FAILURE)"
      - The windows-instances ConfigMap, which triggers the BYOH process, is deleted.
      - The original Windows nodes become unstable, with only one remaining in a Ready state.

      Expected results:

          - The new BYOH Windows node should be successfully configured by WMCO and join the cluster in a Ready state.
          - Existing Windows nodes should remain stable and in the Ready state without interruption.
          - WMCO should not enter an error loop, and the cloud-private-key should remain accessible.

      Additional info:

          The root cause appears to be the loss or invalidation of the `cloud-private-key` secret within the `openshift-windows-machine-config-operator` namespace. 
      The creation of the windows-instances ConfigMap correctly triggers reconciliation. 
      However, WMCO is then unable to decrypt the annotations on existing nodes because it can't access its private key.
      This triggers a cascading failure where the operator cannot manage any of its nodes, new or old, and gets stuck in an error loop.
      
      The file transfer failures (sftp: "Failure") to the new node suggest a secondary issue, possibly with permissions or the state of the SSH server on the new VM after bootstrapping, but the primary blocker is the decryption failure on the operator side.
      
      

        1. winbyoh-6kvlf-after.yaml
          3 kB
          Jose Valdes
        2. winbyoh-6kvlf-before.yaml
          3 kB
          Jose Valdes

              jvaldes@redhat.com Jose Valdes
              rrasouli Aharon Rasouli
              None
              None
              Aharon Rasouli Aharon Rasouli
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: