-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.20
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Problem:
When creating a NodePool on AWS with instance type p5.4xlarge in us-west-2a, the node provisioning fails.
The underlying failure is captured correctly inside the AWSMachine status, but the full EC2 error is not surfaced/propagated to the NodePool conditions.
The NodePool only shows a generic failure (“Ignition not reached” / provisioning failed), which makes debugging difficult.
Observed AWSMachine error:
- apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachine
metadata:
annotations:
cluster.x-k8s.io/cloned-from-groupkind: AWSMachineTemplate.infrastructure.cluster.x-k8s.io
cluster.x-k8s.io/cloned-from-name: aaraj-hcp-1-workers-f49d826a
hypershift.openshift.io/nodePool: ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39/aaraj-hcp-1-workers
creationTimestamp: "2025-11-26T03:36:40Z"
finalizers:
- awsmachine.infrastructure.cluster.x-k8s.io
generation: 1
labels:
2mq2bl4aellodv-aaeba791-aaraj-hcp-1-workers: 2mq2bl4aellodv-aaeba791-aaraj-hcp-1-workers
cluster.x-k8s.io/cluster-name: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
cluster.x-k8s.io/deployment-name: aaraj-hcp-1-workers
cluster.x-k8s.io/set-name: aaraj-hcp-1-workers-ksssm
machine-template-hash: 824003499-ksssm
name: aaraj-hcp-1-workers-ksssm-vxd8d
namespace: ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39-aaraj-hcp-1
ownerReferences:
- apiVersion: cluster.x-k8s.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: Machine
name: aaraj-hcp-1-workers-ksssm-vxd8d
uid: 973e6ecc-d461-4579-9418-d737e80dc8c6
resourceVersion: "4544048933"
uid: 8f455d0d-3245-45a1-9a5b-b46781a9f1d4
spec:
additionalSecurityGroups:
- id: sg-0ed97efa5cb09674d
additionalTags:
api.openshift.com/environment: staging
api.openshift.com/id: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
api.openshift.com/legal-entity-id: 1jlfDskrR39egznAq3T18Ul0Xxv
api.openshift.com/name: aaraj-hcp-1
api.openshift.com/nodepool-hypershift: aaraj-hcp-1-workers
api.openshift.com/nodepool-ocm: workers
kubernetes.io/cluster/2mq2bl4aellodvf3ekpf5h5vs7qfji39: owned
red-hat-clustertype: rosa
red-hat-managed: "true"
ami:
id: ami-0ec67ca44c1b28d5f
cloudInit:
insecureSkipSecretsManager: true
secureSecretsBackend: secrets-manager
iamInstanceProfile: rosa-service-managed-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39-aaraj-hcp-1-worker
instanceMetadataOptions:
httpEndpoint: enabled
httpPutResponseHopLimit: 2
httpTokens: optional
instanceMetadataTags: disabled
instanceType: p5.4xlarge
rootVolume:
encrypted: true
size: 300
type: gp3
subnet:
id: subnet-058477d82da808dff
uncompressedUserData: true
status:
conditions:
- lastTransitionTime: "2025-11-26T03:36:46Z"
message: 0 of 2 completed
reason: InstanceProvisionFailed
severity: Error
status: "False"
type: Ready
- lastTransitionTime: "2025-11-26T14:25:34Z"
message: 'failed to create AWSMachine instance: failed to run instance: operation
error EC2: RunInstances, exceeded maximum number of attempts, 3, https response
error StatusCode: 500, RequestID: 3ae55879-fcb0-470f-ab06-095b8a594496, api
error InsufficientInstanceCapacity: We currently do not have sufficient p5.4xlarge
capacity in the Availability Zone you requested (us-west-2a). Our system will
be working on provisioning additional capacity. You can currently get p5.4xlarge
capacity by not specifying an Availability Zone in your request or choosing
us-west-2b, us-west-2c, us-west-2d.'
reason: InstanceProvisionFailed
severity: Error
status: "False"
type: InstanceReady
- lastTransitionTime: "2025-11-26T13:08:30Z"
reason: NotPaused
status: "False"
type: Paused
Observed nodepool yaml :-
oc get np -n ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39 aaraj-hcp-1-workers -o yaml
apiVersion: hypershift.openshift.io/v1beta1
kind: NodePool
metadata:
annotations:
hypershift.openshift.io/ec2-instance-metadata-http-tokens: optional
creationTimestamp: "2025-11-26T03:29:51Z"
finalizers:
- hypershift.openshift.io/finalizer
generation: 13
labels:
api.openshift.com/environment: staging
api.openshift.com/id: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
api.openshift.com/legal-entity-id: 1jlfDskrR39egznAq3T18Ul0Xxv
api.openshift.com/name: aaraj-hcp-1
name: aaraj-hcp-1-workers
namespace: ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39
ownerReferences:
- apiVersion: work.open-cluster-management.io/v1
kind: AppliedManifestWork
name: 8d70e512b62c74fa37d498fa32faf37a8b674db0b7166722e966128dcc1c9bf8-2mq2bl4aellodvf3ekpf5h5vs7qfji39-workers
uid: b71ea58b-6f04-4230-a625-0a135b43dbb6
- apiVersion: hypershift.openshift.io/v1beta1
kind: HostedCluster
name: aaraj-hcp-1
uid: 29948764-b5ac-4161-ad02-9cc220f11ceb
resourceVersion: "4543573097"
uid: 9475e274-3dec-431d-bcab-433d05271236
spec:
arch: amd64
clusterName: aaraj-hcp-1
management:
autoRepair: true
replace:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
strategy: RollingUpdate
upgradeType: Replace
nodeVolumeDetachTimeout: 5m0s
pausedUntil: "false"
platform:
aws:
instanceProfile: rosa-service-managed-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39-aaraj-hcp-1-worker
instanceType: p5.4xlarge
resourceTags:
- key: api.openshift.com/environment
value: staging
- key: api.openshift.com/id
value: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
- key: api.openshift.com/legal-entity-id
value: 1jlfDskrR39egznAq3T18Ul0Xxv
- key: api.openshift.com/name
value: aaraj-hcp-1
- key: api.openshift.com/nodepool-hypershift
value: aaraj-hcp-1-workers
- key: api.openshift.com/nodepool-ocm
value: workers
- key: red-hat-clustertype
value: rosa
- key: red-hat-managed
value: "true"
rootVolume:
encrypted: true
size: 300
type: gp3
subnet:
id: subnet-058477d82da808dff
type: AWS
release:
image: quay.io/openshift-release-dev/ocp-release@sha256:40cdd399e6243207a2d5bb6f2eaf8b57d01495c937c882eb6431a86f18439b70
replicas: 2
status:
conditions:
- lastTransitionTime: "2025-11-26T03:30:01Z"
observedGeneration: 13
reason: AsExpected
status: "False"
type: AutoscalingEnabled
- lastTransitionTime: "2025-11-26T03:30:01Z"
observedGeneration: 13
reason: AsExpected
status: "True"
type: UpdateManagementEnabled
- lastTransitionTime: "2025-11-26T03:30:02Z"
message: 'Using release image: quay.io/openshift-release-dev/ocp-release@sha256:40cdd399e6243207a2d5bb6f2eaf8b57d01495c937c882eb6431a86f18439b70'
observedGeneration: 13
reason: AsExpected
status: "True"
type: ValidReleaseImage
- lastTransitionTime: "2025-11-26T03:31:39Z"
observedGeneration: 13
reason: AsExpected
status: "True"
type: ValidArchPlatform
- lastTransitionTime: "2025-11-26T13:07:15Z"
message: Reconciliation active on resource
observedGeneration: 13
reason: ReconciliationActive
status: "True"
type: ReconciliationActive
- lastTransitionTime: "2025-11-26T03:30:14Z"
message: 'Updating version in progress. Target version: 4.20.2'
observedGeneration: 13
reason: AsExpected
status: "True"
type: UpdatingVersion
- lastTransitionTime: "2025-11-26T03:30:22Z"
message: |
2 of 2 machines are not ready
Machine aaraj-hcp-1-workers-ksssm-vxd8d: InstanceProvisionFailed
Machine aaraj-hcp-1-workers-ksssm-75lq2: InstanceProvisionFailed
observedGeneration: 13
reason: InstanceProvisionFailed
status: "False"
type: AllMachinesReady
- lastTransitionTime: "2025-11-26T03:30:22Z"
message: |
Machine aaraj-hcp-1-workers-ksssm-vxd8d: WaitingForNodeRef
Machine aaraj-hcp-1-workers-ksssm-75lq2: WaitingForNodeRef
observedGeneration: 13
reason: WaitingForNodeRef
status: "False"
type: AllNodesHealthy
- lastTransitionTime: "2025-11-26T03:30:22Z"
message: All is well
observedGeneration: 13
reason: AsExpected
status: "True"
type: ValidPlatformConfig
- lastTransitionTime: "2025-11-26T03:34:43Z"
observedGeneration: 13
reason: AsExpected
status: "True"
type: ValidMachineConfig
- lastTransitionTime: "2025-11-26T03:38:08Z"
message: Payload generated successfully
observedGeneration: 13
reason: AsExpected
status: "True"
type: ValidGeneratedPayload
- lastTransitionTime: "2025-11-26T03:35:54Z"
message: 'Updating config in progress. Target config: e21b31c0'
observedGeneration: 13
reason: AsExpected
status: "True"
type: UpdatingConfig
- lastTransitionTime: "2025-11-26T03:36:02Z"
observedGeneration: 13
reason: ignitionNotReached
status: "False"
type: ReachedIgnitionEndpoint
- lastTransitionTime: "2025-11-26T03:36:03Z"
message: Bootstrap AMI is "ami-0ec67ca44c1b28d5f"
observedGeneration: 13
reason: AsExpected
status: "True"
type: ValidPlatformImage
- lastTransitionTime: "2025-11-26T03:36:35Z"
message: NodePool has a default security group
observedGeneration: 13
reason: AsExpected
status: "True"
type: AWSSecurityGroupAvailable
- lastTransitionTime: "2025-11-26T03:36:07Z"
observedGeneration: 13
reason: AsExpected
status: "True"
type: ValidTuningConfig
- lastTransitionTime: "2025-11-26T03:36:40Z"
message: 'platform machine template update in progress. Target template: aaraj-hcp-1-workers-f49d826a'
observedGeneration: 13
reason: AsExpected
status: "True"
type: UpdatingPlatformMachineTemplate
- lastTransitionTime: "2025-11-26T03:37:12Z"
message: Minimum availability requires 2 replicas, current 0 available
observedGeneration: 13
reason: WaitingForAvailableMachines
status: "False"
type: Ready
Expected behavior: The full EC2 error—including the InsufficientInstanceCapacity message and AZ recommendations—should be propagated (or summarized) in the NodePool conditions/events. Users should be able to determine capacity issues directly from NodePool status without needing to manually inspect the AWSMachine resources.
Actual behavior: NodePool simply reports provisioning failure (“Ignition not reached”) with no details. The capacity error is visible only under the underlying AWSMachine CR.
Impact: Debugging is difficult for customers and QE. NodePool appear stuck without actionable information. Users cannot identify AWS capacity issues unless they inspect low-level resources.
Environment: Staging
Cluster type: Public, HCP
AZ: us-west-2a
Instance type: p5.4xlarge
The command used to create the cluster :-
rosa create cluster -c aaraj-hcp --hosted-cp --sts --subnet-ids subnet-0077e54c1ad03828d,subnet-058477d82da808dff --role-arn arn:aws:iam::765374464689:role/aaraj-HCP-ROSA-Installer-Role --worker-iam-role-arn arn:aws:iam::765374464689:role/aaraj-HCP-ROSA-Worker-Role --support-role-arn arn:aws:iam::765374464689:role/aaraj-HCP-ROSA-Support-Role --region us-west-2 --mode auto -y --billing-account 361660083367 --version 4.20.2 --oidc-config-id 2lc041cno3npp4uriags9gfk900lu2he --compute-machine-type p5.4xlarge
Acceptance Criteria
- NodePool surfaces underlying AWS EC2 RunInstances errors
-
- The full error (or a meaningful summarized version) from AWSMachine—including InsufficientInstanceCapacity—is propagated to the NodePool status or events.
- NodePool conditions include actionable error details
-
- NodePool.conditions[ProvisioningFailed] (or equivalent) must show a clear message indicating AWS capacity shortage or similar EC2 errors.
- User can identify AWS capacity issues without inspecting AWSMachine CRs
-
- It should not be required to manually inspect AWSMachine resources to see the EC2 error.
-
- QE can reproduce behavior and verify the error is visible in NodePool status
-
-
- When forcing an instance type with no AZ capacity, the detailed error must appear directly on the NodePool
-
- depends on
-
OCPSTRAT-1615 [Observability] Enhanced Debuggability for HyperShift Cluster NodePool Failures
-
- In Progress
-