Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Fleet Manager
Labels:
- qe

Activity Type:
Quality / Stability / Reliability
Story Points:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Original story points:
2
Severity:
Critical

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Summary

Management Cluster hs-mc-n1j3kghkg in integration environment is failing to provision ROSA HCP clusters with worker nodes stuck in NodeProvisioning / EC2 instance looping state. This is blocking integration pipelines and affecting multiple developers.

Description

During ROSA HCP provisioning in integration, the Management Cluster hs-mc-n1j3kghkg shows clusters progressing to “ready” control plane state but worker nodes never become healthy. EC2 instances appear to loop and fail to stabilize, preventing workers from joining.

Symptoms observed:

Hosted clusters created but remain in Partial state with messages such as Waiting for Kube APIServer deployment to become available.

Worker nodes events show repeated deletion/removal because the corresponding EC2 instances do not exist in AWS.
Multiple pods in hosted clusters are in CrashLoopBackOff or Init:CrashLoopBackOff states (e.g., config-policy-controller, audit-webhook, cloud-network-config-controller, network-node-identity, openshift-route-controller-manager).

Impact

This MC (hs-mc-n1j3kghkg, region: us-west-2) is used as the base for all integration pipelines (pre-merge, post-merge, and local dev).
Current degradation blocks developers and CI/CD flows; blast radius is high when integration is unhealthy.

Original Data / Evidence

HostedClusters Status (partial extract)

ocm-avulaj-...   avulaj-le-test   Partial   False   Waiting for Kube APIServer deployment
ocm-int-2l874qo5 ocpe2e-1pgj1ana  Completed False   Waiting for Kube APIServer deployment
ocm-int-2lg10ea  ocpe2e-keryxv5s  Partial   False   Waiting for Kube APIServer deployment

Node Events (from MC workers)

Normal DeletingNode  node/ip-10-0-115-180.us-west-2.compute.internal
Deleting node ... because it does not exist in the cloud provider
Normal RemovingNode  Node ... event: Removing Node ... from Controller

CrashLoopBackOff Pods (sample)

config-policy-controller-*    CrashLoopBackOff
audit-webhook-*               Init:CrashLoopBackOff
cloud-network-config-*        CrashLoopBackOff
network-node-identity-*       CrashLoopBackOff
openshift-route-controller-*  CrashLoopBackOff

Condition Sample (workers stuck)

AllMachinesReady: True
AllNodesHealthy: False
Reason: NodeProvisioning
Message: Machine ... NodeProvisioning

Next possible Steps / Requests:

Provide remediation path: repair existing MC or create replacement in us-west-2 to unblock integration pipelines
Confirm whether the issue originates from the Management Cluster health vs dataplane HCP worker provisioning.
Identify root cause of EC2 instance loops and worker node failures.

relates to

OCPSTRAT-1615 [Observability] Enhanced Debuggability for HyperShift Cluster NodePool Failures

Assignee:: Unassigned

Reporter:: Chunxi Luo

QA Contact:: Eveline Cai

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/09/25 3:14 PM

Updated:: 2025/10/21 2:34 PM

Resolved:: 2025/10/21 2:34 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates