Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63729

CentralUSEAUP worker machine creation fails with error on platformUpdateDomainCount

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • Yes
    • None
    • None
    • In Progress
    • Bug Fix
    • Hide
      Before this update, the Azure Machine API provider incorrectly attempted to use a default `platformUpdateDomainCount` of 5, even in specific regions, such as CentralUSEUAP, that are restricted to a single fault domain. This caused machine creation to fail for all node types in these affected regions because Azure supports only one update domain when the fault domain count is set to one. With this release, the logic has been updated to explicitly set the `platformUpdateDomainCount` to 1 whenever a single fault domain is detected. As a result, Machine Availability Sets are now created with valid parameter combinations, allowing nodes to successfully provision in Azure regions that utilize a single fault domain. (link:https://issues.redhat.com/browse/OCPBUGS-63729[OCPBUGS-6372])
      Show
      Before this update, the Azure Machine API provider incorrectly attempted to use a default `platformUpdateDomainCount` of 5, even in specific regions, such as CentralUSEUAP, that are restricted to a single fault domain. This caused machine creation to fail for all node types in these affected regions because Azure supports only one update domain when the fault domain count is set to one. With this release, the logic has been updated to explicitly set the `platformUpdateDomainCount` to 1 whenever a single fault domain is detected. As a result, Machine Availability Sets are now created with valid parameter combinations, allowing nodes to successfully provision in Azure regions that utilize a single fault domain. (link: https://issues.redhat.com/browse/OCPBUGS-63729 [OCPBUGS-6372])
    • None
    • None
    • None
    • None

      Description of problem:

          In Azure, in CentralUSEUAP, when creating an OCP cluster (applies to ARO too), worker machine fail at being created. Looking at error, message, it seems that the underlying Availability Set creation fails with error 
      AvailabilitySet "<somethingsomething>" with platformFaultDomainCount = 1 can only support platformUpdateDomainCount = 1
      
      This error echoes some of the things in https://issues.redhat.com/browse/OCPBUGS-45663. The way I understand the MAPI code 
      https://github.com/openshift/machine-api-provider-azure/blob/5a6516188d4ec33734e1a069da2acc7a469657dc/pkg/cloud/azure/services/availabilitysets/availabilitysets.go#L48
      
      is that to fix OCPBUGS-45663, the platformFaultDomainCount is now computed dynamically to 1 for that special region. But the platformUpdateDomainCount is hardcoded to 5, which sounds to be incompatible with platformUpdateDomainCount set to 1 (apparently, Azure seems to expect platformUpdateDomainCount to be only 1 in that case). 

      Version-Release number of selected component (if applicable):

          observed 4.16, 4.17, 4.18

      How reproducible:

      systematic    

      Steps to Reproduce:

          1. Create an OCP cluster on Azure (or an ARO cluster) with any of the versions that contains the fix for https://issues.redhat.com/browse/OCPBUGS-45663  in CentralusEUAP
          2. Worker Machine creation fail.
          3.
          

      Actual results:

          MAPI does not create the underlying Worker VM, error appears about "AvailabilitySet "<somethingsomething>" with platformFaultDomainCount = 1 can only support platformUpdateDomainCount = 1"

      Expected results:

      Worker VM are created and machine goes running    

      Additional info:

          This error echoes some of the things in https://issues.redhat.com/browse/OCPBUGS-45663. The way I understand the MAPI code 
      https://github.com/openshift/machine-api-provider-azure/blob/5a6516188d4ec33734e1a069da2acc7a469657dc/pkg/cloud/azure/services/availabilitysets/availabilitysets.go#L48
      
      is that to fix OCPBUGS-45663, the platformFaultDomainCount is now computed dynamically to 1 for that special region. But the platformUpdateDomainCount is hardcoded to 5, which sounds to be incompatible with platformUpdateDomainCount set to 1 (apparently, Azure seems to expect platformUpdateDomainCount to be only 1 in that case). 
      
      I am not certain this is something that changed recently on Azure side or if the incompatibility between those two paramaters has always been there.

              rmanak@redhat.com Radek Manak
              gvanderp@redhat.com Ghislain VANDERPOTTE (Inactive)
              Christophe LACOMBE
              None
              Zhaohua Sun Zhaohua Sun
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: