Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5759

Deletion of BYOH Windows node hangs in Ready,SchedulingDisabled

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • None
    • 4.12
    • None
    • Moderate
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      When deleting a BYOH node in Platform:none, as well as in an Azure IPI cluster the node gets reconciled correctly, however when added back to the cluster it stays in Ready,SchedulingDisabled. When checking the WMCO logs, we can observe the following log:
      
      {"level":"error","ts":"2022-12-14T16:14:31Z","msg":"Reconciler error","controller":"configmap","controllerGroup":"","controllerKind":"ConfigMap","configMap":{"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"},"namespace":"openshift-windows-machine-config-operator","name":"windows-instances","reconcileID":"d66a3142-d52c-43f5-8a42-214ce9c88417","error":"error configuring host with address 10.0.55.21: configuring node network failed: error waiting for k8s.ovn.org/hybrid-overlay-node-subnet node annotation for byoh-2019: timeout waiting for k8s.ovn.org/hybrid-overlay-node-subnet node annotation: timed out waiting for the condition"
      
      And when checking the node's annotation, it is indeed missing:
      
      $ oc get nodes byoh-2019 -o=jsonpath="{.metadata.annotations}"
      {"volumes.kubernetes.io/controller-managed-attach-detach":"true","windowsmachineconfig.openshift.io/desired-version":"7.0.0-16f486a","windowsmachineconfig.openshift.io/pub-key-hash":"1df2c166b1c401180523270e9cf6bc2cd2724b9279ea65668a3b95298525a0f5","windowsmachineconfig.openshift.io/username":"wx4EBwMICL6qT+4RY8tgbx4hiRmQdHlwUsHgVGCTVY7S5gG/G5gb/Wzv0JBLhNP9\u003cwmcoMarker\u003ejlmI5ExHPYFrd2Fw6Lxe/6PKEE5/vYAhZ2n1Z2nBIoa1xN1/HEaXhqR2CuXNe7Ez\u003cwmcoMarker\u003eg2Hg+gA=\u003cwmcoMarker\u003e=ubWA"}
      
      Tested in Azure IPI and Platform:None, in both cases the issue got reproduced.
      
      

      Version-Release number of selected component (if applicable):

      $ oc get cm -n openshift-windows-machine-config-operator 
      NAME                                   DATA   AGE
      kube-root-ca.crt                       1      10h
      openshift-service-ca.crt               1      10h
      windows-instances                      2      9h
      windows-machine-config-operator-lock   0      6h24m
      windows-services-7.0.0-16f486a         2      6h23m
      $ oc get clusterversion
      NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.0-rc.4   True        False         6h48m   Cluster version is 4.12.0-rc.4
      

      How reproducible:

      
      

      Steps to Reproduce:

      1. Deploy a OCP 4.11 cluster with WMCO 6.0.0
      2. Add one or two byoh nodes to the cluster
      3. Upgrade the cluster to OCP 4.12, and later WMCO to 7.0.0
      4. Remove one of the byoh nodes using: oc delete node <byoh-node-id>
      5. Wait for reconciliation to bring the node back
      

      Actual results:

      The deleted node gets re-added but stays in Ready,SchedulingDisabled and the workloads left in Pending state.
      

      Expected results:

      The node gets properly added to the cluster and stays in Ready.
      

      Additional info:

      
      

      Attachments

        Issue Links

          Activity

            People

              jtanenba@redhat.com Jacob Tanenbaum
              rhn-engineering-jfrancoa Jose Luis Franco Arza (Inactive)
              Aharon Rasouli Aharon Rasouli
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: