Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42241

[CAPI Azure] Failed to provision machines when setting controlPlane instnace type as Standard_M8-4ms

XMLWordPrintable

    • Moderate
    • None
    • Installer Sprint 260, Installer Sprint 261, Installer Sprint 262
    • 3
    • False
    • Hide

      None

      Show
      None
    • Hide
      * When installing a cluster on {azure-full}, specifying the `Standard_M8-4ms` instance type results in an error due to that instance type specifying its memory in decimal format instead of integer format. (link:https://issues.redhat.com/browse/OCPBUGS-42241[*OCPBUGS-42241*])

      (please check with QE if they are not allowed as compute nodes as well.)
      Show
      * When installing a cluster on {azure-full}, specifying the `Standard_M8-4ms` instance type results in an error due to that instance type specifying its memory in decimal format instead of integer format. (link: https://issues.redhat.com/browse/OCPBUGS-42241 [* OCPBUGS-42241 *]) (please check with QE if they are not allowed as compute nodes as well.)
    • Known Issue
    • Done

      Description of problem:

      Create cluster on instance type Standard_M8-4ms, installer failed to provision machines.
      
      install-config:
      ================
      controlPlane:
        architecture: amd64
        hyperthreading: Enabled
        name: master
        platform:
          azure:
            type: Standard_M8-4ms
      
      Create cluster:
      =====================
      $ ./openshift-install create cluster --dir ipi3
      INFO Waiting up to 15m0s (until 2:31AM UTC) for machines [jimainstance01-h45wv-bootstrap jimainstance01-h45wv-master-0 jimainstance01-h45wv-master-1 jimainstance01-h45wv-master-2] to provision... 
      ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded 
      INFO Shutting down local Cluster API controllers... 
      INFO Stopped controller: Cluster API              
      WARNING process cluster-api-provider-azure exited with error: signal: killed 
      INFO Stopped controller: azure infrastructure provider 
      INFO Stopped controller: azureaso infrastructure provider 
      INFO Shutting down local Cluster API control plane... 
      INFO Local Cluster API system has completed operation
      
      In openshift-install.log, all machines were created failed with below error:
      =================
      time="2024-09-20T02:17:07Z" level=debug msg="I0920 02:17:07.757980 1747698 recorder.go:104] \"failed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to get desired parameters for resource jimainstance01-h45wv-rg/jimainstance01-h45wv-bootstrap (service: virtualmachine): reconcile error that cannot be recovered occurred: failed to validate the memory capability: failed to parse string '218.75' as int64: strconv.ParseInt: parsing \\\"218.75\\\": invalid syntax. Object will not be requeued\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AzureMachine\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"jimainstance01-h45wv-bootstrap\",\"uid\":\"d67a2010-f489-44b4-9be9-88d7b136a45b\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta1\",\"resourceVersion\":\"1530\"} reason=\"ReconcileError\""
      ...
      time="2024-09-20T02:17:12Z" level=debug msg="Checking that machine jimainstance01-h45wv-bootstrap has provisioned..."
      time="2024-09-20T02:17:12Z" level=debug msg="Machine jimainstance01-h45wv-bootstrap has not yet provisioned: Failed"
      time="2024-09-20T02:17:12Z" level=debug msg="Checking that machine jimainstance01-h45wv-master-0 has provisioned..."
      time="2024-09-20T02:17:12Z" level=debug msg="Machine jimainstance01-h45wv-master-0 has not yet provisioned: Failed"
      time="2024-09-20T02:17:12Z" level=debug msg="Checking that machine jimainstance01-h45wv-master-1 has provisioned..."
      time="2024-09-20T02:17:12Z" level=debug msg="Machine jimainstance01-h45wv-master-1 has not yet provisioned: Failed"
      time="2024-09-20T02:17:12Z" level=debug msg="Checking that machine jimainstance01-h45wv-master-2 has provisioned..."
      time="2024-09-20T02:17:12Z" level=debug msg="Machine jimainstance01-h45wv-master-2 has not yet provisioned: Failed"
      ... 
      
      Also see same error in .clusterapi_output/Machine-openshift-cluster-api-guests-jimainstance01-h45wv-bootstrap.yaml
      ===================
      $ yq-go r Machine-openshift-cluster-api-guests-jimainstance01-h45wv-bootstrap.yaml 'status'
      noderef: null
      nodeinfo: null
      lastupdated: "2024-09-20T02:17:07Z"
      failurereason: CreateError
      failuremessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1beta1,
        Kind=AzureMachine with name "jimainstance01-h45wv-bootstrap": failed to reconcile
        AzureMachine service virtualmachine: failed to get desired parameters for resource
        jimainstance01-h45wv-rg/jimainstance01-h45wv-bootstrap (service: virtualmachine):
        reconcile error that cannot be recovered occurred: failed to validate the memory
        capability: failed to parse string ''218.75'' as int64: strconv.ParseInt: parsing
        "218.75": invalid syntax. Object will not be requeued'
      addresses: []
      phase: Failed
      certificatesexpirydate: null
      bootstrapready: false
      infrastructureready: false
      observedgeneration: 1
      conditions:
      - type: Ready
        status: "False"
        severity: Error
        lasttransitiontime: "2024-09-20T02:17:07Z"
        reason: Failed
        message: 0 of 2 completed
      - type: InfrastructureReady
        status: "False"
        severity: Error
        lasttransitiontime: "2024-09-20T02:17:07Z"
        reason: Failed
        message: 'virtualmachine failed to create or update. err: failed to get desired
          parameters for resource jimainstance01-h45wv-rg/jimainstance01-h45wv-bootstrap
          (service: virtualmachine): reconcile error that cannot be recovered occurred:
          failed to validate the memory capability: failed to parse string ''218.75'' as
          int64: strconv.ParseInt: parsing "218.75": invalid syntax. Object will not be
          requeued'
      - type: NodeHealthy
        status: "False"
        severity: Info
        lasttransitiontime: "2024-09-20T02:16:27Z"
        reason: WaitingForNodeRef
        message: ""
      
      
      From above error, seems unable to parse the memory of instance type Standard_M8-4ms, which is a decimal, not an integer.
      
      $ az vm list-skus --size Standard_M8-4ms  --location southcentralus | jq -r '.[].capabilities[] | select(.name=="MemoryGB")'
      {
        "name": "MemoryGB",
        "value": "218.75"
      }

      Version-Release number of selected component (if applicable):

      4.17.0-0.nightly-2024-09-16-082730

       

      How reproducible:

       Always

      Steps to Reproduce:

          1. set controlPlane type as Standard_M8-4ms in install-config
          2. create cluster
          3.
          

      Actual results:

          Installation failed

      Expected results:

          Installation succeeded

      Additional info:

          

              sdasu@redhat.com Sandhya Dasu
              jinyunma Jinyun Ma
              Jinyun Ma Jinyun Ma
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: