Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-66103

NodePool provisioning fails on AWS due to missing instance capacity, but full EC2 error is not propagated to NodePool status

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.20
    • HyperShift / ROSA
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Problem:
      When creating a NodePool on AWS with instance type p5.4xlarge in us-west-2a, the node provisioning fails.
      The underlying failure is captured correctly inside the AWSMachine status, but the full EC2 error is not surfaced/propagated to the NodePool conditions.
      The NodePool only shows a generic failure (“Ignition not reached” / provisioning failed), which makes debugging difficult.

      Observed AWSMachine error:

      - apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSMachine
        metadata:
          annotations:
            cluster.x-k8s.io/cloned-from-groupkind: AWSMachineTemplate.infrastructure.cluster.x-k8s.io
            cluster.x-k8s.io/cloned-from-name: aaraj-hcp-1-workers-f49d826a
            hypershift.openshift.io/nodePool: ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39/aaraj-hcp-1-workers
          creationTimestamp: "2025-11-26T03:36:40Z"
          finalizers:
          - awsmachine.infrastructure.cluster.x-k8s.io
          generation: 1
          labels:
            2mq2bl4aellodv-aaeba791-aaraj-hcp-1-workers: 2mq2bl4aellodv-aaeba791-aaraj-hcp-1-workers
            cluster.x-k8s.io/cluster-name: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
            cluster.x-k8s.io/deployment-name: aaraj-hcp-1-workers
            cluster.x-k8s.io/set-name: aaraj-hcp-1-workers-ksssm
            machine-template-hash: 824003499-ksssm
          name: aaraj-hcp-1-workers-ksssm-vxd8d
          namespace: ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39-aaraj-hcp-1
          ownerReferences:
          - apiVersion: cluster.x-k8s.io/v1beta1
            blockOwnerDeletion: true
            controller: true
            kind: Machine
            name: aaraj-hcp-1-workers-ksssm-vxd8d
            uid: 973e6ecc-d461-4579-9418-d737e80dc8c6
          resourceVersion: "4544048933"
          uid: 8f455d0d-3245-45a1-9a5b-b46781a9f1d4
        spec:
          additionalSecurityGroups:
          - id: sg-0ed97efa5cb09674d
          additionalTags:
            api.openshift.com/environment: staging
            api.openshift.com/id: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
            api.openshift.com/legal-entity-id: 1jlfDskrR39egznAq3T18Ul0Xxv
            api.openshift.com/name: aaraj-hcp-1
            api.openshift.com/nodepool-hypershift: aaraj-hcp-1-workers
            api.openshift.com/nodepool-ocm: workers
            kubernetes.io/cluster/2mq2bl4aellodvf3ekpf5h5vs7qfji39: owned
            red-hat-clustertype: rosa
            red-hat-managed: "true"
          ami:
            id: ami-0ec67ca44c1b28d5f
          cloudInit:
            insecureSkipSecretsManager: true
            secureSecretsBackend: secrets-manager
          iamInstanceProfile: rosa-service-managed-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39-aaraj-hcp-1-worker
          instanceMetadataOptions:
            httpEndpoint: enabled
            httpPutResponseHopLimit: 2
            httpTokens: optional
            instanceMetadataTags: disabled
          instanceType: p5.4xlarge
          rootVolume:
            encrypted: true
            size: 300
            type: gp3
          subnet:
            id: subnet-058477d82da808dff
          uncompressedUserData: true
        status:
          conditions:
          - lastTransitionTime: "2025-11-26T03:36:46Z"
            message: 0 of 2 completed
            reason: InstanceProvisionFailed
            severity: Error
            status: "False"
            type: Ready
          - lastTransitionTime: "2025-11-26T14:25:34Z"
            message: 'failed to create AWSMachine instance: failed to run instance: operation
              error EC2: RunInstances, exceeded maximum number of attempts, 3, https response
              error StatusCode: 500, RequestID: 3ae55879-fcb0-470f-ab06-095b8a594496, api
              error InsufficientInstanceCapacity: We currently do not have sufficient p5.4xlarge
              capacity in the Availability Zone you requested (us-west-2a). Our system will
              be working on provisioning additional capacity. You can currently get p5.4xlarge
              capacity by not specifying an Availability Zone in your request or choosing
              us-west-2b, us-west-2c, us-west-2d.'
            reason: InstanceProvisionFailed
            severity: Error
            status: "False"
            type: InstanceReady
          - lastTransitionTime: "2025-11-26T13:08:30Z"
            reason: NotPaused
            status: "False"
            type: Paused 

      Observed nodepool yaml :- 

      oc get np -n ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39 aaraj-hcp-1-workers -o yaml
      apiVersion: hypershift.openshift.io/v1beta1
      kind: NodePool
      metadata:
        annotations:
          hypershift.openshift.io/ec2-instance-metadata-http-tokens: optional
        creationTimestamp: "2025-11-26T03:29:51Z"
        finalizers:
        - hypershift.openshift.io/finalizer
        generation: 13
        labels:
          api.openshift.com/environment: staging
          api.openshift.com/id: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
          api.openshift.com/legal-entity-id: 1jlfDskrR39egznAq3T18Ul0Xxv
          api.openshift.com/name: aaraj-hcp-1
        name: aaraj-hcp-1-workers
        namespace: ocm-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39
        ownerReferences:
        - apiVersion: work.open-cluster-management.io/v1
          kind: AppliedManifestWork
          name: 8d70e512b62c74fa37d498fa32faf37a8b674db0b7166722e966128dcc1c9bf8-2mq2bl4aellodvf3ekpf5h5vs7qfji39-workers
          uid: b71ea58b-6f04-4230-a625-0a135b43dbb6
        - apiVersion: hypershift.openshift.io/v1beta1
          kind: HostedCluster
          name: aaraj-hcp-1
          uid: 29948764-b5ac-4161-ad02-9cc220f11ceb
        resourceVersion: "4543573097"
        uid: 9475e274-3dec-431d-bcab-433d05271236
      spec:
        arch: amd64
        clusterName: aaraj-hcp-1
        management:
          autoRepair: true
          replace:
            rollingUpdate:
              maxSurge: 1
              maxUnavailable: 0
            strategy: RollingUpdate
          upgradeType: Replace
        nodeVolumeDetachTimeout: 5m0s
        pausedUntil: "false"
        platform:
          aws:
            instanceProfile: rosa-service-managed-staging-2mq2bl4aellodvf3ekpf5h5vs7qfji39-aaraj-hcp-1-worker
            instanceType: p5.4xlarge
            resourceTags:
            - key: api.openshift.com/environment
              value: staging
            - key: api.openshift.com/id
              value: 2mq2bl4aellodvf3ekpf5h5vs7qfji39
            - key: api.openshift.com/legal-entity-id
              value: 1jlfDskrR39egznAq3T18Ul0Xxv
            - key: api.openshift.com/name
              value: aaraj-hcp-1
            - key: api.openshift.com/nodepool-hypershift
              value: aaraj-hcp-1-workers
            - key: api.openshift.com/nodepool-ocm
              value: workers
            - key: red-hat-clustertype
              value: rosa
            - key: red-hat-managed
              value: "true"
            rootVolume:
              encrypted: true
              size: 300
              type: gp3
            subnet:
              id: subnet-058477d82da808dff
          type: AWS
        release:
          image: quay.io/openshift-release-dev/ocp-release@sha256:40cdd399e6243207a2d5bb6f2eaf8b57d01495c937c882eb6431a86f18439b70
        replicas: 2
      status:
        conditions:
        - lastTransitionTime: "2025-11-26T03:30:01Z"
          observedGeneration: 13
          reason: AsExpected
          status: "False"
          type: AutoscalingEnabled
        - lastTransitionTime: "2025-11-26T03:30:01Z"
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: UpdateManagementEnabled
        - lastTransitionTime: "2025-11-26T03:30:02Z"
          message: 'Using release image: quay.io/openshift-release-dev/ocp-release@sha256:40cdd399e6243207a2d5bb6f2eaf8b57d01495c937c882eb6431a86f18439b70'
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: ValidReleaseImage
        - lastTransitionTime: "2025-11-26T03:31:39Z"
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: ValidArchPlatform
        - lastTransitionTime: "2025-11-26T13:07:15Z"
          message: Reconciliation active on resource
          observedGeneration: 13
          reason: ReconciliationActive
          status: "True"
          type: ReconciliationActive
        - lastTransitionTime: "2025-11-26T03:30:14Z"
          message: 'Updating version in progress. Target version: 4.20.2'
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: UpdatingVersion
        - lastTransitionTime: "2025-11-26T03:30:22Z"
          message: |
            2 of 2 machines are not ready
            Machine aaraj-hcp-1-workers-ksssm-vxd8d: InstanceProvisionFailed
            Machine aaraj-hcp-1-workers-ksssm-75lq2: InstanceProvisionFailed
          observedGeneration: 13
          reason: InstanceProvisionFailed
          status: "False"
          type: AllMachinesReady
        - lastTransitionTime: "2025-11-26T03:30:22Z"
          message: |
            Machine aaraj-hcp-1-workers-ksssm-vxd8d: WaitingForNodeRef
            Machine aaraj-hcp-1-workers-ksssm-75lq2: WaitingForNodeRef
          observedGeneration: 13
          reason: WaitingForNodeRef
          status: "False"
          type: AllNodesHealthy
        - lastTransitionTime: "2025-11-26T03:30:22Z"
          message: All is well
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: ValidPlatformConfig
        - lastTransitionTime: "2025-11-26T03:34:43Z"
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: ValidMachineConfig
        - lastTransitionTime: "2025-11-26T03:38:08Z"
          message: Payload generated successfully
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: ValidGeneratedPayload
        - lastTransitionTime: "2025-11-26T03:35:54Z"
          message: 'Updating config in progress. Target config: e21b31c0'
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: UpdatingConfig
        - lastTransitionTime: "2025-11-26T03:36:02Z"
          observedGeneration: 13
          reason: ignitionNotReached
          status: "False"
          type: ReachedIgnitionEndpoint
        - lastTransitionTime: "2025-11-26T03:36:03Z"
          message: Bootstrap AMI is "ami-0ec67ca44c1b28d5f"
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: ValidPlatformImage
        - lastTransitionTime: "2025-11-26T03:36:35Z"
          message: NodePool has a default security group
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: AWSSecurityGroupAvailable
        - lastTransitionTime: "2025-11-26T03:36:07Z"
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: ValidTuningConfig
        - lastTransitionTime: "2025-11-26T03:36:40Z"
          message: 'platform machine template update in progress. Target template: aaraj-hcp-1-workers-f49d826a'
          observedGeneration: 13
          reason: AsExpected
          status: "True"
          type: UpdatingPlatformMachineTemplate
        - lastTransitionTime: "2025-11-26T03:37:12Z"
          message: Minimum availability requires 2 replicas, current 0 available
          observedGeneration: 13
          reason: WaitingForAvailableMachines
          status: "False"
          type: Ready 

      Expected behavior: The full EC2 error—including the InsufficientInstanceCapacity message and AZ recommendations—should be propagated (or summarized) in the NodePool conditions/events. Users should be able to determine capacity issues directly from NodePool status without needing to manually inspect the AWSMachine resources.

      Actual behavior: NodePool simply reports provisioning failure (“Ignition not reached”) with no details. The capacity error is visible only under the underlying AWSMachine CR.

      Impact: Debugging is difficult for customers and QE. NodePool appear stuck without actionable information. Users cannot identify AWS capacity issues unless they inspect low-level resources.

      Environment: Staging

      Cluster type: Public, HCP

      AZ: us-west-2a

      Instance type: p5.4xlarge

      The command used to create the cluster :-

       rosa create cluster -c aaraj-hcp --hosted-cp --sts --subnet-ids subnet-0077e54c1ad03828d,subnet-058477d82da808dff  --role-arn arn:aws:iam::765374464689:role/aaraj-HCP-ROSA-Installer-Role --worker-iam-role-arn arn:aws:iam::765374464689:role/aaraj-HCP-ROSA-Worker-Role --support-role-arn arn:aws:iam::765374464689:role/aaraj-HCP-ROSA-Support-Role --region us-west-2 --mode auto  -y --billing-account 361660083367 --version 4.20.2 --oidc-config-id 2lc041cno3npp4uriags9gfk900lu2he --compute-machine-type p5.4xlarge

      Acceptance Criteria

      1. NodePool surfaces underlying AWS EC2 RunInstances errors
        • The full error (or a meaningful summarized version) from AWSMachine—including InsufficientInstanceCapacity—is propagated to the NodePool status or events.
      1. NodePool conditions include actionable error details
        • NodePool.conditions[ProvisioningFailed] (or equivalent) must show a clear message indicating AWS capacity shortage or similar EC2 errors.
      1. User can identify AWS capacity issues without inspecting AWSMachine CRs
        • It should not be required to manually inspect AWSMachine resources to see the EC2 error.
        • QE can reproduce behavior and verify the error is visible in NodePool status
          • When forcing an instance type with no AZ capacity, the detailed error must appear directly on the NodePool

              sminonne@redhat.com Salvatore Dario Minonne
              rh-ee-aaraj Aadarsh Raj
              None
              None
              Jie Zhao Jie Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: