Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-73755

HostedCluster installation gets stuck with nodes in “Removing from cluster” state when multiple HostedClusters share the same agentNamespace

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      While installing an Agent-based HostedCluster using HyperShift, only 2 out of 4 worker nodes successfully join the cluster. The remaining 2 nodes are visible in oc get nodes but are stuck in “Removing from cluster” state during installation. 
      
      Description
      Customer is deploying multiple HostedClusters (agent-based) via ArgoCD using a common build manifest. During installation of one HostedCluster, only 2/4 NodePool replicas successfully complete provisioning and receive intended roles/labels. The other 2 nodes get stuck in “Removing from cluster” state during installation, even though they appear in oc get nodes.
      This results in:
      
      
      HostedCluster installation incomplete
      NodePool replica count not achieved
      Node role labels inconsistent across nodes
      
      
      The build manifest places multiple HostedClusters and Agent inventory resources in the same namespace (clusters) and uses:
      
      
      platform.agent.agentNamespace: clusters for all HostedClusters
      InfraEnv, BareMetalHost, NMStateConfig also in the same clusters namespace
      
      This appears to allow cross-cluster agent adoption / reconciliation conflicts, leading to stuck node removal during provisioning.
      
      Customer Impact
      Hosted cluster installation fails or remains stuck
      Nodes oscillate/remain stuck in removing state
      Prevents scaling / installing additional hosted clusters reliably
      Requires manual remediation / reinstall   

      Version-Release number of selected component (if applicable):

      4.18    

      How reproducible:

      100%    

      Steps to Reproduce:

         1. Create a namespace clusters
         2. Deploy multiple HostedCluster resources into clusters
         3. Deploy multiple NodePool resources into clusters (replicas=4)
         4. Set the following on all HostedClusters:
             platform:  
               agent:
                 agentNamespace: clusters
      
         5. Create all inventory resources (InfraEnv, BareMetalHost, NMStateConfig) in the same namespace clusters
         6. Start HostedCluster installation for one hosted cluster
         7. Observe that only 2/4 nodes successfully complete join and labeling; remaining nodes get stuck in “Removing from cluster”     

      Actual results:

      some of the nodes fails to add to the cluster    

      Expected results:

      All node should get added to the cluster    

      Additional info:

          

              cchun@redhat.com Crystal Chun
              rhn-support-chdeshpa Chinmay Deshpande
              None
              None
              Vladislav Kolodny Vladislav Kolodny
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: