Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: 4.20.0
Affects Version/s: 4.18, 4.19, 4.20
Component/s: Cloud Compute / Machine API Providers
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:

4.18.z, 4.19.z
Target Version:

4.20.0
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Before this update, scaling large numbers of nodes was slow because scaling requires reconciling each machine several times and each machine was reconciled individually. With this release, up to ten machines can be reconciled concurrently. This change improves the processing speed for machines during scaling. (link:https://issues.redhat.com/browse/OCPBUGS-59376[~~OCPBUGS-59376~~])

Show
* Before this update, scaling large numbers of nodes was slow because scaling requires reconciling each machine several times and each machine was reconciled individually. With this release, up to ten machines can be reconciled concurrently. This change improves the processing speed for machines during scaling. (link: https://issues.redhat.com/browse/OCPBUGS-59376 [ OCPBUGS-59376 ])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    In a recent hackathon with Amadeus, we found that scaling of Nodes on GCP (0-400) was bottlenecked by sequential processing of reconcile requests in the Machine API provider for GCP.

Adding the ability to configure and then scale the nodes using parallel execution of 10 reconciles at once, significantly improved the performance.

Version-Release number of selected component (if applicable):

    4.20 and below

How reproducible:

    100%

Steps to Reproduce:

    1. Create an OCP cluster on GCP
    2. Scale several machinesets to a total of around 400 nodes
    3. Observe machines take approximately 20 minutes to join the cluster


Below steps shared by Zhaohua Sun

1.set up a cluster with flexy-install, you can rebuild this job, just update INSTANCE_NAME_PREFIX and LAUNCHER_VARS , add below to LAUNCHER_VARS
vm_type_masters: 'n2-standard-16'
vm_type_workers: 'n2-standard-2'

2. create infra nodes, you can rebuild this job , update BUILD_NUMBER  with your flexy job id

3. once the above are down, scale up machineset, as said in bug, I scale to 400 nodes by 3 times. the first time
oc scale machineset zhsungcp-djlkm-worker-b --replicas 50                                
oc scale machineset zhsungcp-djlkm-worker-c --replicas 50
oc scale machineset zhsungcp-djlkm-worker-d --replicas 50
   the second time
$ oc scale machineset zhsungcp-djlkm-worker-b --replicas 100                            
oc scale machineset zhsungcp-djlkm-worker-c --replicas 100
oc scale machineset zhsungcp-djlkm-worker-d --replicas 100
   the third time
oc scale machineset zhsungcp-djlkm-worker-b --replicas 130                         
oc scale machineset zhsungcp-djlkm-worker-c --replicas 130
oc scale machineset zhsungcp-djlkm-worker-d --replicas 140

Actual results:

    Nodes take a significant time to join the cluster

Expected results:

    Nodes should join the cluster quickly

Additional info:

is cloned by

OCPBUGS-59386 [release-4.19] GCP scaling is slow when scaling large volumes of nodes

Closed

is depended on by

OCPBUGS-59386 [release-4.19] GCP scaling is slow when scaling large volumes of nodes

Closed

links to

openshift/machine-api-operator#1390: OCPBUGS-59376: Enabled 10 concurrent reconciles on GCP

openshift/machine-api-provider-gcp#124: OCPBUGS-59376: Add max-concurrent-reconciles flag to machine actuator

Assignee:: Joel Speed

Reporter:: Joel Speed

Need Info From:: None

Contributors:: None

QA Contact:: Meha Bhalodiya

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/07/16 9:18 AM

Updated:: 2025/10/21 4:42 AM

Resolved:: 2025/10/21 4:42 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide