Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.20
Component/s: HyperShift / ROSA
Labels:
- triaged

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Problem:
When creating a NodePool on AWS with instance type p5.4xlarge in us-west-2a, the node provisioning fails.
The underlying failure is captured correctly inside the AWSMachine status, but the full EC2 error is not surfaced/propagated to the NodePool conditions.
The NodePool only shows a generic failure (“Ignition not reached” / provisioning failed), which makes debugging difficult.

Observed AWSMachine error:

- apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
  kind: AWSMachine
  metadata:
    annotations:
      cluster.x-k8s.io/cloned-from-groupkind: AWSMachineTemplate.infrastructure.cluster.x-k8s.io
      cluster.x-k8s.io/cloned-from-name: aaraj-hcp-1-workers-f49d826a
      hypershift.openshift.io/nodePool: ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39/aaraj-hcp-1-workers
    creationTimestamp: "2025-11-26T03:36:40Z"
    finalizers:
    - awsmachine.infrastructure.cluster.x-k8s.io
    generation: 1
    labels:
      2mq2bl4aellodv-aaeba791-aaraj-hcp-1-workers: 2mq2bl4aellodv-aaeba791-aaraj-hcp-1-workers
      cluster.x-k8s.io/cluster-name: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
      cluster.x-k8s.io/deployment-name: aaraj-hcp-1-workers
      cluster.x-k8s.io/set-name: aaraj-hcp-1-workers-ksssm
      machine-template-hash: 824003499-ksssm
    name: aaraj-hcp-1-workers-ksssm-vxd8d
    namespace: ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39-aaraj-hcp-1
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: Machine
      name: aaraj-hcp-1-workers-ksssm-vxd8d
      uid: 973e6ecc-d461-4579-9418-d737e80dc8c6
    resourceVersion: "4544048933"
    uid: 8f455d0d-3245-45a1-9a5b-b46781a9f1d4
  spec:
    additionalSecurityGroups:
    - id: sg-0ed97efa5cb09674d
    additionalTags:
      api.openshift.com/environment: staging
      api.openshift.com/id: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
      api.openshift.com/legal-entity-id: 1jlfDskrR39egznAq3T18Ul0Xxv
      api.openshift.com/name: aaraj-hcp-1
      api.openshift.com/nodepool-hypershift: aaraj-hcp-1-workers
      api.openshift.com/nodepool-ocm: workers
      kubernetes.io/cluster/2mq2bl4aellodvf3ekpf5h5vs7qfji39: owned
      red-hat-clustertype: rosa
      red-hat-managed: "true"
    ami:
      id: ami-0ec67ca44c1b28d5f
    cloudInit:
      insecureSkipSecretsManager: true
      secureSecretsBackend: secrets-manager
    iamInstanceProfile: rosa-service-managed-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39-aaraj-hcp-1-worker
    instanceMetadataOptions:
      httpEndpoint: enabled
      httpPutResponseHopLimit: 2
      httpTokens: optional
      instanceMetadataTags: disabled
    instanceType: p5.4xlarge
    rootVolume:
      encrypted: true
      size: 300
      type: gp3
    subnet:
      id: subnet-058477d82da808dff
    uncompressedUserData: true
  status:
    conditions:
    - lastTransitionTime: "2025-11-26T03:36:46Z"
      message: 0 of 2 completed
      reason: InstanceProvisionFailed
      severity: Error
      status: "False"
      type: Ready
    - lastTransitionTime: "2025-11-26T14:25:34Z"
      message: 'failed to create AWSMachine instance: failed to run instance: operation
        error EC2: RunInstances, exceeded maximum number of attempts, 3, https response
        error StatusCode: 500, RequestID: 3ae55879-fcb0-470f-ab06-095b8a594496, api
        error InsufficientInstanceCapacity: We currently do not have sufficient p5.4xlarge
        capacity in the Availability Zone you requested (us-west-2a). Our system will
        be working on provisioning additional capacity. You can currently get p5.4xlarge
        capacity by not specifying an Availability Zone in your request or choosing
        us-west-2b, us-west-2c, us-west-2d.'
      reason: InstanceProvisionFailed
      severity: Error
      status: "False"
      type: InstanceReady
    - lastTransitionTime: "2025-11-26T13:08:30Z"
      reason: NotPaused
      status: "False"
      type: Paused

Observed nodepool yaml :-

oc get np -n ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39 aaraj-hcp-1-workers -o yaml
apiVersion: hypershift.openshift.io/v1beta1
kind: NodePool
metadata:
  annotations:
    hypershift.openshift.io/ec2-instance-metadata-http-tokens: optional
  creationTimestamp: "2025-11-26T03:29:51Z"
  finalizers:
  - hypershift.openshift.io/finalizer
  generation: 13
  labels:
    api.openshift.com/environment: staging
    api.openshift.com/id: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
    api.openshift.com/legal-entity-id: 1jlfDskrR39egznAq3T18Ul0Xxv
    api.openshift.com/name: aaraj-hcp-1
  name: aaraj-hcp-1-workers
  namespace: ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39
  ownerReferences:
  - apiVersion: work.open-cluster-management.io/v1
    kind: AppliedManifestWork
    name: 8d70e512b62c74fa37d498fa32faf37a8b674db0b7166722e966128dcc1c9bf8-2mq2bl4aellodvf3ekpf5h5vs7qfji39-workers
    uid: b71ea58b-6f04-4230-a625-0a135b43dbb6
  - apiVersion: hypershift.openshift.io/v1beta1
    kind: HostedCluster
    name: aaraj-hcp-1
    uid: 29948764-b5ac-4161-ad02-9cc220f11ceb
  resourceVersion: "4543573097"
  uid: 9475e274-3dec-431d-bcab-433d05271236
spec:
  arch: amd64
  clusterName: aaraj-hcp-1
  management:
    autoRepair: true
    replace:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 0
      strategy: RollingUpdate
    upgradeType: Replace
  nodeVolumeDetachTimeout: 5m0s
  pausedUntil: "false"
  platform:
    aws:
      instanceProfile: rosa-service-managed-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39-aaraj-hcp-1-worker
      instanceType: p5.4xlarge
      resourceTags:
      - key: api.openshift.com/environment
        value: staging
      - key: api.openshift.com/id
        value: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
      - key: api.openshift.com/legal-entity-id
        value: 1jlfDskrR39egznAq3T18Ul0Xxv
      - key: api.openshift.com/name
        value: aaraj-hcp-1
      - key: api.openshift.com/nodepool-hypershift
        value: aaraj-hcp-1-workers
      - key: api.openshift.com/nodepool-ocm
        value: workers
      - key: red-hat-clustertype
        value: rosa
      - key: red-hat-managed
        value: "true"
      rootVolume:
        encrypted: true
        size: 300
        type: gp3
      subnet:
        id: subnet-058477d82da808dff
    type: AWS
  release:
    image: quay.io/openshift-release-dev/ocp-release@sha256:40cdd399e6243207a2d5bb6f2eaf8b57d01495c937c882eb6431a86f18439b70
  replicas: 2
status:
  conditions:
  - lastTransitionTime: "2025-11-26T03:30:01Z"
    observedGeneration: 13
    reason: AsExpected
    status: "False"
    type: AutoscalingEnabled
  - lastTransitionTime: "2025-11-26T03:30:01Z"
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: UpdateManagementEnabled
  - lastTransitionTime: "2025-11-26T03:30:02Z"
    message: 'Using release image: quay.io/openshift-release-dev/ocp-release@sha256:40cdd399e6243207a2d5bb6f2eaf8b57d01495c937c882eb6431a86f18439b70'
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: ValidReleaseImage
  - lastTransitionTime: "2025-11-26T03:31:39Z"
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: ValidArchPlatform
  - lastTransitionTime: "2025-11-26T13:07:15Z"
    message: Reconciliation active on resource
    observedGeneration: 13
    reason: ReconciliationActive
    status: "True"
    type: ReconciliationActive
  - lastTransitionTime: "2025-11-26T03:30:14Z"
    message: 'Updating version in progress. Target version: 4.20.2'
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: UpdatingVersion
  - lastTransitionTime: "2025-11-26T03:30:22Z"
    message: |
      2 of 2 machines are not ready
      Machine aaraj-hcp-1-workers-ksssm-vxd8d: InstanceProvisionFailed
      Machine aaraj-hcp-1-workers-ksssm-75lq2: InstanceProvisionFailed
    observedGeneration: 13
    reason: InstanceProvisionFailed
    status: "False"
    type: AllMachinesReady
  - lastTransitionTime: "2025-11-26T03:30:22Z"
    message: |
      Machine aaraj-hcp-1-workers-ksssm-vxd8d: WaitingForNodeRef
      Machine aaraj-hcp-1-workers-ksssm-75lq2: WaitingForNodeRef
    observedGeneration: 13
    reason: WaitingForNodeRef
    status: "False"
    type: AllNodesHealthy
  - lastTransitionTime: "2025-11-26T03:30:22Z"
    message: All is well
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: ValidPlatformConfig
  - lastTransitionTime: "2025-11-26T03:34:43Z"
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: ValidMachineConfig
  - lastTransitionTime: "2025-11-26T03:38:08Z"
    message: Payload generated successfully
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: ValidGeneratedPayload
  - lastTransitionTime: "2025-11-26T03:35:54Z"
    message: 'Updating config in progress. Target config: e21b31c0'
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: UpdatingConfig
  - lastTransitionTime: "2025-11-26T03:36:02Z"
    observedGeneration: 13
    reason: ignitionNotReached
    status: "False"
    type: ReachedIgnitionEndpoint
  - lastTransitionTime: "2025-11-26T03:36:03Z"
    message: Bootstrap AMI is "ami-0ec67ca44c1b28d5f"
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: ValidPlatformImage
  - lastTransitionTime: "2025-11-26T03:36:35Z"
    message: NodePool has a default security group
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: AWSSecurityGroupAvailable
  - lastTransitionTime: "2025-11-26T03:36:07Z"
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: ValidTuningConfig
  - lastTransitionTime: "2025-11-26T03:36:40Z"
    message: 'platform machine template update in progress. Target template: aaraj-hcp-1-workers-f49d826a'
    observedGeneration: 13
    reason: AsExpected
    status: "True"
    type: UpdatingPlatformMachineTemplate
  - lastTransitionTime: "2025-11-26T03:37:12Z"
    message: Minimum availability requires 2 replicas, current 0 available
    observedGeneration: 13
    reason: WaitingForAvailableMachines
    status: "False"
    type: Ready

Expected behavior: The full EC2 error—including the InsufficientInstanceCapacity message and AZ recommendations—should be propagated (or summarized) in the NodePool conditions/events. Users should be able to determine capacity issues directly from NodePool status without needing to manually inspect the AWSMachine resources.

Actual behavior: NodePool simply reports provisioning failure (“Ignition not reached”) with no details. The capacity error is visible only under the underlying AWSMachine CR.

Impact: Debugging is difficult for customers and QE. NodePool appear stuck without actionable information. Users cannot identify AWS capacity issues unless they inspect low-level resources.

Environment: Staging

Cluster type: Public, HCP

AZ: us-west-2a

Instance type: p5.4xlarge

The command used to create the cluster :-

 rosa create cluster -c aaraj-hcp --hosted-cp --sts --subnet-ids subnet-0077e54c1ad03828d,subnet-058477d82da808dff  --role-arn arn:aws:iam::765374464689:role/aaraj-HCP-ROSA-Installer-Role --worker-iam-role-arn arn:aws:iam::765374464689:role/aaraj-HCP-ROSA-Worker-Role --support-role-arn arn:aws:iam::765374464689:role/aaraj-HCP-ROSA-Support-Role --region us-west-2 --mode auto  -y --billing-account 361660083367 --version 4.20.2 --oidc-config-id 2lc041cno3npp4uriags9gfk900lu2he --compute-machine-type p5.4xlarge

Acceptance Criteria

NodePool surfaces underlying AWS EC2 RunInstances errors

- The full error (or a meaningful summarized version) from AWSMachine—including InsufficientInstanceCapacity—is propagated to the NodePool status or events.

NodePool conditions include actionable error details

- NodePool.conditions[ProvisioningFailed] (or equivalent) must show a clear message indicating AWS capacity shortage or similar EC2 errors.

User can identify AWS capacity issues without inspecting AWSMachine CRs

- It should not be required to manually inspect AWSMachine resources to see the EC2 error.

- QE can reproduce behavior and verify the error is visible in NodePool status

- - When forcing an instance type with no AZ capacity, the detailed error must appear directly on the NodePool

depends on

OCPSTRAT-1615 [Observability] Enhanced Debuggability for HyperShift Cluster NodePool Failures

In Progress

Details

Description

Acceptance Criteria

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates