Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63729

CentralUSEAUP worker machine creation fails with error on platformUpdateDomainCount

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • Yes
    • None
    • None
    • Done
    • Bug Fix
    • Hide
      Before this update, the Azure Machine API provider incorrectly attempted to use a default `platformUpdateDomainCount` value of `5`, even in specific regions, such as CentralUSEUAP, that are restricted to a single fault domain. As a consequence, machine creation failed for all node types in these affected regions because Azure supports only one update domain when the fault domain count is set to `1`. With this release, the logic is updated to explicitly set the `platformUpdateDomainCount` value to `1` whenever a single fault domain is detected. As a result, Machine Availability Sets are created with valid parameter combinations, which allows nodes to successfully provision in Azure regions that use a single fault domain. (link:https://issues.redhat.com/browse/OCPBUGS-63729[OCPBUGS-63729])
      Show
      Before this update, the Azure Machine API provider incorrectly attempted to use a default `platformUpdateDomainCount` value of `5`, even in specific regions, such as CentralUSEUAP, that are restricted to a single fault domain. As a consequence, machine creation failed for all node types in these affected regions because Azure supports only one update domain when the fault domain count is set to `1`. With this release, the logic is updated to explicitly set the `platformUpdateDomainCount` value to `1` whenever a single fault domain is detected. As a result, Machine Availability Sets are created with valid parameter combinations, which allows nodes to successfully provision in Azure regions that use a single fault domain. (link: https://issues.redhat.com/browse/OCPBUGS-63729 [ OCPBUGS-63729 ])
    • None
    • None
    • None
    • None

      Description of problem:

          In Azure, in CentralUSEUAP, when creating an OCP cluster (applies to ARO too), worker machine fail at being created. Looking at error, message, it seems that the underlying Availability Set creation fails with error 
      AvailabilitySet "<somethingsomething>" with platformFaultDomainCount = 1 can only support platformUpdateDomainCount = 1
      
      This error echoes some of the things in https://issues.redhat.com/browse/OCPBUGS-45663. The way I understand the MAPI code 
      https://github.com/openshift/machine-api-provider-azure/blob/5a6516188d4ec33734e1a069da2acc7a469657dc/pkg/cloud/azure/services/availabilitysets/availabilitysets.go#L48
      
      is that to fix OCPBUGS-45663, the platformFaultDomainCount is now computed dynamically to 1 for that special region. But the platformUpdateDomainCount is hardcoded to 5, which sounds to be incompatible with platformUpdateDomainCount set to 1 (apparently, Azure seems to expect platformUpdateDomainCount to be only 1 in that case). 

      Version-Release number of selected component (if applicable):

          observed 4.16, 4.17, 4.18

      How reproducible:

      systematic    

      Steps to Reproduce:

          1. Create an OCP cluster on Azure (or an ARO cluster) with any of the versions that contains the fix for https://issues.redhat.com/browse/OCPBUGS-45663  in CentralusEUAP
          2. Worker Machine creation fail.
          3.
          

      Actual results:

          MAPI does not create the underlying Worker VM, error appears about "AvailabilitySet "<somethingsomething>" with platformFaultDomainCount = 1 can only support platformUpdateDomainCount = 1"

      Expected results:

      Worker VM are created and machine goes running    

      Additional info:

          This error echoes some of the things in https://issues.redhat.com/browse/OCPBUGS-45663. The way I understand the MAPI code 
      https://github.com/openshift/machine-api-provider-azure/blob/5a6516188d4ec33734e1a069da2acc7a469657dc/pkg/cloud/azure/services/availabilitysets/availabilitysets.go#L48
      
      is that to fix OCPBUGS-45663, the platformFaultDomainCount is now computed dynamically to 1 for that special region. But the platformUpdateDomainCount is hardcoded to 5, which sounds to be incompatible with platformUpdateDomainCount set to 1 (apparently, Azure seems to expect platformUpdateDomainCount to be only 1 in that case). 
      
      I am not certain this is something that changed recently on Azure side or if the incompatibility between those two paramaters has always been there.

              rmanak@redhat.com Radek Manak
              gvanderp@redhat.com Ghislain VANDERPOTTE (Inactive)
              Zhaohua Sun Zhaohua Sun
              Christophe LACOMBE
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: