Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-65708

CentralUSEAUP worker machine creation fails with error on platformUpdateDomainCount

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • Yes
    • None
    • None
    • None
    • None
    • None
    • Hide
      Cause: The Azure Machine API provider attempted to use the default platformUpdateDomainCount of 5 even in regions that are restricted to a single fault domain.

      Consequence: Machine creation fails for all node types in affected region (CentralUSEUAP) because Azure only support 1 update domain when the fault domain count is 1.

      Fix: The logic was updated to explicitly set the platformUpdateDomainCount to 1 whenever the platformFaultDomainCount is determined to be 1.

      Result: Machine Availability Sets are created with valid parameter combinations, allowing machines to successfully provision in Azure regions with a single fault domain.
      Show
      Cause: The Azure Machine API provider attempted to use the default platformUpdateDomainCount of 5 even in regions that are restricted to a single fault domain. Consequence: Machine creation fails for all node types in affected region (CentralUSEUAP) because Azure only support 1 update domain when the fault domain count is 1. Fix: The logic was updated to explicitly set the platformUpdateDomainCount to 1 whenever the platformFaultDomainCount is determined to be 1. Result: Machine Availability Sets are created with valid parameter combinations, allowing machines to successfully provision in Azure regions with a single fault domain.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-63729. The following is the description of the original issue:

      Description of problem:

          In Azure, in CentralUSEUAP, when creating an OCP cluster (applies to ARO too), worker machine fail at being created. Looking at error, message, it seems that the underlying Availability Set creation fails with error 
      AvailabilitySet "<somethingsomething>" with platformFaultDomainCount = 1 can only support platformUpdateDomainCount = 1
      
      This error echoes some of the things in https://issues.redhat.com/browse/OCPBUGS-45663. The way I understand the MAPI code 
      https://github.com/openshift/machine-api-provider-azure/blob/5a6516188d4ec33734e1a069da2acc7a469657dc/pkg/cloud/azure/services/availabilitysets/availabilitysets.go#L48
      
      is that to fix OCPBUGS-45663, the platformFaultDomainCount is now computed dynamically to 1 for that special region. But the platformUpdateDomainCount is hardcoded to 5, which sounds to be incompatible with platformUpdateDomainCount set to 1 (apparently, Azure seems to expect platformUpdateDomainCount to be only 1 in that case). 

      Version-Release number of selected component (if applicable):

          observed 4.16, 4.17, 4.18

      How reproducible:

      systematic    

      Steps to Reproduce:

          1. Create an OCP cluster on Azure (or an ARO cluster) with any of the versions that contains the fix for https://issues.redhat.com/browse/OCPBUGS-45663  in CentralusEUAP
          2. Worker Machine creation fail.
          3.
          

      Actual results:

          MAPI does not create the underlying Worker VM, error appears about "AvailabilitySet "<somethingsomething>" with platformFaultDomainCount = 1 can only support platformUpdateDomainCount = 1"

      Expected results:

      Worker VM are created and machine goes running    

      Additional info:

          This error echoes some of the things in https://issues.redhat.com/browse/OCPBUGS-45663. The way I understand the MAPI code 
      https://github.com/openshift/machine-api-provider-azure/blob/5a6516188d4ec33734e1a069da2acc7a469657dc/pkg/cloud/azure/services/availabilitysets/availabilitysets.go#L48
      
      is that to fix OCPBUGS-45663, the platformFaultDomainCount is now computed dynamically to 1 for that special region. But the platformUpdateDomainCount is hardcoded to 5, which sounds to be incompatible with platformUpdateDomainCount set to 1 (apparently, Azure seems to expect platformUpdateDomainCount to be only 1 in that case). 
      
      I am not certain this is something that changed recently on Azure side or if the incompatibility between those two paramaters has always been there.

              rmanak@redhat.com Radek Manak
              gvanderp@redhat.com Ghislain VANDERPOTTE
              None
              None
              Zhaohua Sun Zhaohua Sun
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: