-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
Quality / Stability / Reliability
-
1
-
False
-
-
False
-
-
-
2
-
Critical
-
None
Summary
Management Cluster hs-mc-n1j3kghkg in integration environment is failing to provision ROSA HCP clusters with worker nodes stuck in NodeProvisioning / EC2 instance looping state. This is blocking integration pipelines and affecting multiple developers.
Description
During ROSA HCP provisioning in integration, the Management Cluster hs-mc-n1j3kghkg shows clusters progressing to “ready” control plane state but worker nodes never become healthy. EC2 instances appear to loop and fail to stabilize, preventing workers from joining.
Symptoms observed:
- Hosted clusters created but remain in Partial state with messages such as Waiting for Kube APIServer deployment to become available.
- Worker nodes events show repeated deletion/removal because the corresponding EC2 instances do not exist in AWS.
- Multiple pods in hosted clusters are in CrashLoopBackOff or Init:CrashLoopBackOff states (e.g., config-policy-controller, audit-webhook, cloud-network-config-controller, network-node-identity, openshift-route-controller-manager).
Impact
- This MC (hs-mc-n1j3kghkg, region: us-west-2) is used as the base for all integration pipelines (pre-merge, post-merge, and local dev).
- Current degradation blocks developers and CI/CD flows; blast radius is high when integration is unhealthy.
Original Data / Evidence
HostedClusters Status (partial extract)
ocm-avulaj-... avulaj-le-test Partial False Waiting for Kube APIServer deployment ocm-int-2l874qo5 ocpe2e-1pgj1ana Completed False Waiting for Kube APIServer deployment ocm-int-2lg10ea ocpe2e-keryxv5s Partial False Waiting for Kube APIServer deployment
Node Events (from MC workers)
Normal DeletingNode node/ip-10-0-115-180.us-west-2.compute.internal Deleting node ... because it does not exist in the cloud provider Normal RemovingNode Node ... event: Removing Node ... from Controller
CrashLoopBackOff Pods (sample)
config-policy-controller-* CrashLoopBackOff audit-webhook-* Init:CrashLoopBackOff cloud-network-config-* CrashLoopBackOff network-node-identity-* CrashLoopBackOff openshift-route-controller-* CrashLoopBackOff
Condition Sample (workers stuck)
AllMachinesReady: True AllNodesHealthy: False Reason: NodeProvisioning Message: Machine ... NodeProvisioning
Next possible Steps / Requests:
- Provide remediation path: repair existing MC or create replacement in us-west-2 to unblock integration pipelines
- Confirm whether the issue originates from the Management Cluster health vs dataplane HCP worker provisioning.
- Identify root cause of EC2 instance loops and worker node failures.
- relates to
-
OCPSTRAT-1615 [Observability] Enhanced Debuggability for HyperShift Cluster NodePool Failures
-
- New
-