Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-24687

Integration MC failing to provision ROSA HCP workers (EC2 looping, pipelines blocked)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • Fleet Manager
    • Quality / Stability / Reliability
    • 1
    • False
    • Hide

      None

      Show
      None
    • False
    • 2
    • Critical
    • None

      Summary

      Management Cluster hs-mc-n1j3kghkg in integration environment is failing to provision ROSA HCP clusters with worker nodes stuck in NodeProvisioning / EC2 instance looping state. This is blocking integration pipelines and affecting multiple developers.

       

      Description

      During ROSA HCP provisioning in integration, the Management Cluster hs-mc-n1j3kghkg shows clusters progressing to “ready” control plane state but worker nodes never become healthy. EC2 instances appear to loop and fail to stabilize, preventing workers from joining.

      Symptoms observed:

      • Hosted clusters created but remain in Partial state with messages such as Waiting for Kube APIServer deployment to become available.
      • Worker nodes events show repeated deletion/removal because the corresponding EC2 instances do not exist in AWS.
      • Multiple pods in hosted clusters are in CrashLoopBackOff or Init:CrashLoopBackOff states (e.g., config-policy-controller, audit-webhook, cloud-network-config-controller, network-node-identity, openshift-route-controller-manager).

       

      Impact

      • This MC (hs-mc-n1j3kghkg, region: us-west-2) is used as the base for all integration pipelines (pre-merge, post-merge, and local dev).
      • Current degradation blocks developers and CI/CD flows; blast radius is high when integration is unhealthy.

       

      Original Data / Evidence

      HostedClusters Status (partial extract)

      ocm-avulaj-...   avulaj-le-test   Partial   False   Waiting for Kube APIServer deployment
      ocm-int-2l874qo5 ocpe2e-1pgj1ana  Completed False   Waiting for Kube APIServer deployment
      ocm-int-2lg10ea  ocpe2e-keryxv5s  Partial   False   Waiting for Kube APIServer deployment 

      Node Events (from MC workers)

      Normal DeletingNode  node/ip-10-0-115-180.us-west-2.compute.internal
      Deleting node ... because it does not exist in the cloud provider
      Normal RemovingNode  Node ... event: Removing Node ... from Controller 

      CrashLoopBackOff Pods (sample)

      config-policy-controller-*    CrashLoopBackOff
      audit-webhook-*               Init:CrashLoopBackOff
      cloud-network-config-*        CrashLoopBackOff
      network-node-identity-*       CrashLoopBackOff
      openshift-route-controller-*  CrashLoopBackOff 

      Condition Sample (workers stuck)

      AllMachinesReady: True
      AllNodesHealthy: False
      Reason: NodeProvisioning
      Message: Machine ... NodeProvisioning 

       

      Next possible Steps / Requests:

      1. Provide remediation path: repair existing MC or create replacement in us-west-2 to unblock integration pipelines
      2. Confirm whether the issue originates from the Management Cluster health vs dataplane HCP worker provisioning.
      3. Identify root cause of EC2 instance loops and worker node failures.

              Unassigned Unassigned
              chuluo@redhat.com Chunxi Luo
              Eveline Cai Eveline Cai
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: