[OCPBUGS-45663] [azure] Worker machines get Failed state if region has no availability zones or availability set fault domains - Red Hat Issue Tracker

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15
Component/s: Cloud Compute / Machine API Providers
Labels:

Severity:
Moderate
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Previously, we hardcoded the availability set fault domain count to 2, which just happens to work in most regions in Azure because the fault domain counts are typically at least 2, but failed in in centraluseuap or eastusstg regions. This change dynamically set of the availability set fault domain count in a region rather than hardcoding it to 2.

Show
Previously, we hardcoded the availability set fault domain count to 2, which just happens to work in most regions in Azure because the fault domain counts are typically at least 2, but failed in in centraluseuap or eastusstg regions. This change dynamically set of the availability set fault domain count in a region rather than hardcoding it to 2.
Release Note Type:
Bug Fix
Release Note Status:
In Progress
Target Version:

4.19.0
Target Backport Versions:

4.15.z, 4.17.z, 4.16.z, 4.18.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

In Azure, there are 2 regions that don't have availability zones or availability set fault domains (centraluseuap, eastusstg). They are test regions, one of which is in-use by the ARO team.

Machine API provider seems to be hardcoding an availability set fault domain count of 2 in creation of the machineset: https://github.com/openshift/machine-api-provider-azure/blob/main/pkg/cloud/azure/services/availabilitysets/availabilitysets.go#L32, so if there is not at least a fault domain count of 2 in the target region, the install will fail because worker nodes get a Failed status.

This is the error from Azure, reported by the machine API:

`The specified fault domain count 2 must fall in the range 1 to 1.`

Because of this, the regions are not able to support OCP clusters.

Version-Release number of selected component (if applicable):

    Observed in 4.15

How reproducible:

    Very

Steps to Reproduce:

    1. Attempt creation of an OCP cluster in centraluseuap or eastusstg regions
    2. Observe worker machine failures

Actual results:

    Worker machines get a failed state

Expected results:

    Worker machines are able to start. I am guessing that this would happen via dynamic setting of the availability set fault domain count rather than hardcoding it to 2, which right now just happens to work in most regions in Azure because the fault domain counts are typically at least 2.

In upstream, it looks like we're dynamically setting this by querying the amount of fault domains in a region: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/40f0fabc264388de02a88de7fbe400c21d22e7e2/azure/services/availabilitysets/spec.go#L70

Additional info:

blocks

OCPBUGS-48659 [azure] Worker machines get Failed state if region has no availability zones or availability set fault domains

Closed

is cloned by

OCPBUGS-48659 [azure] Worker machines get Failed state if region has no availability zones or availability set fault domains

Closed

links to

openshift/machine-api-provider-azure#124: OCPBUGS-45663: dynamically setting the amount of fault domains

RHEA-2024:11038 OpenShift Container Platform 4.19.z bug fix update

Assignee:: Zhaohua Sun

Reporter:: Caden Marchese

QA Contact:: Zhaohua Sun

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/12/05 2:23 PM

Updated:: 2025/04/15 1:50 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide