Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30809

Failed spot VM machinesets in non-zonal Azure regions

XMLWordPrintable

    • No
    • CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252
    • 3
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-29906. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-29152. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-29007. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-25940. The following is the description of the original issue:

      Description of problem:

      New spot VMs fail to be created by machinesets defining providerSpec.value.spotVMOptions in Azure regions without Availability Zones.
      
      Azure-controller logs the error: Azure Spot Virtual Machine is not supported in Availability Set.
      
      A new availabilitySet is created for each machineset in non-zonal regions, but this only works with normal nodes. Spot VMs and availabilitySets are incompatible as per Microsoft docs for this error: You need to choose to either use an Azure Spot Virtual Machine or use a VM in an availability set, you can't choose both.
      From: https://learn.microsoft.com/en-us/azure/virtual-machines/error-codes-spot

      Version-Release number of selected component (if applicable):

          n/a

      How reproducible:

          Always

      Steps to Reproduce:

      1. Follow the instructions to create a machineset to provision spot VMs: 
        https://docs.openshift.com/container-platform/4.12/machine_management/creating_machinesets/creating-machineset-azure.html#machineset-creating-non-guaranteed-instance_creating-machineset-azure
      
      2. New machines will be in Failed state:
      $ oc get machines -A
      NAMESPACE               NAME                                            PHASE     TYPE              REGION       ZONE   AGE
      openshift-machine-api   mabad-test-l5x58-worker-southindia-spot-c4qr5   Failed                                          7m17s
      openshift-machine-api   mabad-test-l5x58-worker-southindia-spot-dtzsn   Failed                                          7m17s
      openshift-machine-api   mabad-test-l5x58-worker-southindia-spot-tzrhw   Failed                                          7m28s
      
      
      3. Events in the failed machines show errors creating spot VMs with availabilitySets:
      Events:
        Type     Reason             Age                 From                           Message
        ----     ------             ----                ----                           -------
        Warning  FailedCreate       28s                 azure-controller               InvalidConfiguration: failed to reconcile machine "mabad-test-l5x58-worker-southindia-spot-dx78z": failed to create vm mabad-test-l5x58-worker-southindia-spot-dx78z: failure sending request for machine mabad-test-l5x58-worker-southindia-spot-dx78z: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Azure Spot Virtual Machine is not supported in Availability Set. For more information, see http://aka.ms/AzureSpot/errormessages."    

      Actual results:

           Machines stay in Failed state and nodes are not created

      Expected results:

           Machines get created and new spot VM nodes added to the cluster.

      Additional info:

          This problem was identified from a customer alert in an ARO cluster. ICM for ref (requires b- MSFT account): https://portal.microsofticm.com/imp/v3/incidents/incident/455463992/summary

              rmanak@redhat.com Radek Manak
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: