Loading...

XML

Word

Printable

Type: Epic
Resolution: Done-Errata
Priority: Major
Fix Version/s: 2025.1 (Epoxy), rhos-18.0.6
Affects Version/s: None
Component/s: openstack-nova
Labels:
None

Epic Name:
Spread allocation candiates over hosts
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Dev Approval:
Committed
Docs Approval:
No Docs Impact
Epic Status:
To Do
Feature Link:
RHOSSTRAT-222 - DFG:Compute Wishlist
Fixed in Build:
openstack-nova-27.5.2-18.0.20250218144742.el9ost
PM Approval:
Committed
QE Approval:
Committed
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Release Note Text:

Hide
.Allocation candidates are spread between hosts in Placement GET /allocation_candidates query

In a deployment with wide and symmetric provider trees, for example, where there are multiple children providers under the same root having inventory from the same resource class, if the allocation candidate request asks for resources from those children resource providers in multiple request groups the number of possible allocation candidates grows rapidly. The Placement service generates these candidates fully before applying the limit parameter provided in the allocation candidate query. The Placement service takes excessive time and memory to generate this amount of allocation candidates and the client might time out.

To avoid request timeout or out of memory events, a new `[placement]max_allocation_candidates` configuration option is applied during the candidate generation process. By default, the `[placement]max_allocation_candidates` option is set to -1, which means there is no limit and which was the old behavior. Edit the value of this configuration option in affected deployments based on the memory available for the Placement service and the timeout setting of the clients. A suggested value is `100000`.

If the number of generated allocation candidates is limited by the `[placement]max_allocation_candidates` configuration option, you can get candidates from a limited set of root providers, for example, Compute nodes, as the Placement service uses a depth-first strategy, generating all candidates from the first root before considering the next one. To avoid this, use the `[placement]allocation_candidates_generation_strategy` configuration option, which has two possible values:

* `depth-first`: generates all candidates from the first viable root provider before moving to the next. This is the default and this triggers the legacy behavior.
* `breadth-first`: generates candidates from viable roots in a round-robin fashion, creating one candidate from each viable root before creating the second candidate from the first root. This is the possible new behavior.

In a deployment where `[placement]max_allocation_candidates` is configured to a positive number, set `[placement]allocation_candidates_generation_strategy` to `breadth-first`.

Show
.Allocation candidates are spread between hosts in Placement GET /allocation_candidates query In a deployment with wide and symmetric provider trees, for example, where there are multiple children providers under the same root having inventory from the same resource class, if the allocation candidate request asks for resources from those children resource providers in multiple request groups the number of possible allocation candidates grows rapidly. The Placement service generates these candidates fully before applying the limit parameter provided in the allocation candidate query. The Placement service takes excessive time and memory to generate this amount of allocation candidates and the client might time out. To avoid request timeout or out of memory events, a new `[placement]max_allocation_candidates` configuration option is applied during the candidate generation process. By default, the `[placement]max_allocation_candidates` option is set to -1, which means there is no limit and which was the old behavior. Edit the value of this configuration option in affected deployments based on the memory available for the Placement service and the timeout setting of the clients. A suggested value is `100000`. If the number of generated allocation candidates is limited by the `[placement]max_allocation_candidates` configuration option, you can get candidates from a limited set of root providers, for example, Compute nodes, as the Placement service uses a depth-first strategy, generating all candidates from the first root before considering the next one. To avoid this, use the `[placement]allocation_candidates_generation_strategy` configuration option, which has two possible values: * `depth-first`: generates all candidates from the first viable root provider before moving to the next. This is the default and this triggers the legacy behavior. * `breadth-first`: generates candidates from viable roots in a round-robin fashion, creating one candidate from each viable root before creating the second candidate from the first root. This is the possible new behavior. In a deployment where `[placement]max_allocation_candidates` is configured to a positive number, set `[placement]allocation_candidates_generation_strategy` to `breadth-first`.
Release Note Type:
Bug Fix
Release Note Status:
Done
Documentation Type:

Release Notes
Errata Link:
https://errata.engineering.redhat.com/advisory/146621

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Products:

Red Hat OpenStack Services on OpenShift (formerly Red Hat OpenStack Platform)

Problem

When a compute host Resource Provider (RP) has multiple similar child RPs for example multiple PGPU RPs or multiple PCI RPs then the possible number of allocation candidates for a given host increases rapidly. Placement eagerly evaluate all the possible candidates today leading to timeouts in /allocation_candidates query.

Placement GET /allocation_candidates API returns all possible allocation candidates for a given host. If the request has multiple groups requesting the same resources and traits and such group can be fulfilled by multiple providers from the same tree then placement generates all group - RP combination 1). In worst case this is an exponential algorithm.

With 4 RPs where each provides one device and 3 requested devices the first group can fit to 4 RPs then the next can fit to 3 then the last can fit to 2. So Placement returns 4*3*2 = 24 candidates. With 8 RPs and 4 devices this is 8*7*6*5 = 1680. With 8 RPs and 8 devices it is 8! = 40320

Note that if each RP alone can provide resources for every groups then the candidate generation logic always produces len(RP)^len(groups) candidates. For 8 RPs and 4 groups that is 4096 candidates. For 8 RPs and 8 groups that ~16M candidates.

Proposal

Improve the allocation candidate generation logic to

spread the candidates between hosts instead of generating all candidates for a given host first
apply the requested candidate number limit during the candidate generation instead of after the all the candidates are generated.
long term we might need more granular limits in the /allocation_candidates API e.g. to allow limit the number of candidates returned by root providers (compute nodes).

Note that there is a upstream functional test proposed that shows the issue.

Solution proposal

See the proposal and reasoning in the related spike ~~OSPRH-11950~~

Implementation

See the patches https://review.opendev.org/q/topic:%22bug/2070257%22

The release notes quoted below for summary from https://review.opendev.org/c/openstack/placement/+/936832/7/releasenotes/notes/bug-2070257-allocation-candidates-generation-limit-and-strategy.yaml-e73796898163fb55.yaml:

In a deployment with wide and symmetric provider trees, i.e. where there
are multiple children providers under the same root having inventory from
the same resource class (e.g. in case of nova's mdev GPU or PCI in
Placement features) if the allocation candidate request asks for resources
from those children RPs in multiple request groups the number of possible
allocation candidates grows rapidly.
E.g.:

1 root, 8 child RPs with 1 unit of resource each
a_c requests 6 groups with 1 unit of resource each
=> 8*7*6*5*4*3=20160 possible candidates

1 root, 8 child RPs with 6 unit of resources each
a_c requests 6 groups with 6 unit of resources each
=> 8^6=262144 possible candidates

Placement generates these candidates fully before applying the limit
parameter provided in the allocation candidate query to be able do a random
sampling if ``[placement]randomize_allocation_candidates`` is True.

Placement takes excessive time and memory to generates these amount of
allocation candidates and the client might times out waiting for the
response or the Placement API service runs out of memory and crashes.

To avoid request timeout or out of memory events a new
``[placement]max_allocation_candidates`` config option is implemented. This
limit is applied not after like the request limit but during the
candidate generation process. So this new option can be used to limit the
runtime and memory consumption of the Placement API service.

The new config option is defaulted to ``-1``, meaning no limit, to keep the
legacy behavior. We suggest to tune this config in the affected
deployments based on the memory available for the Placement service and the
timeout setting of the clients. A good initial value could be around
``100000``.

If the number of generated allocation candidates are limited by
``[placement]max_allocation_candidates`` config option then it is possible
to get candidates from a limited set of root providers (e.g. computes
nodes) as placement uses a depth-first strategy, i.e. generating all
candidates from the first root before considering the next one. To avoid
issue a new config option
``[placement]allocation_candidates_generation_strategy`` is introduced
with two possible values:

``depth-first``, generates all candidates from the first viable root
provider before moving to the next. This is the default and this
triggers the old behavior

``breadth-first``, generates candidates from viable roots in a
round-robin fashion, creating one candidate from each viable root
before creating the second candidate from the first root. This is the
possible behavior.

In a deployment where ``[placement]max_allocation_candidates`` is
configured to a positive number we recommend to set
``[placement]allocation_candidates_generation_strategy`` to
``breadth-first``.

Performance results

https://docs.google.com/spreadsheets/d/179hQJ2e40d7MGc6UNVoACGMai2kXcY-52qEKAGrtyiw/edit?usp=sharing

blocks

OSPRH-27 Track Mempages in Placement

Closed

relates to

OSPRH-1033 As a developer, I want Nova to support nvidia vGPUS with RHEL10

In Progress

OSPRH-11950 Backportable WA for Placement allocation candidate number explosion

Closed

links to

https://review.opendev.org/q/topic:%22bug/2070257%22 - gerrit review topic

performance results

RHBA-2025:3029 Release of components for RHOSO 18.0.6 (Feature Release 2)

(1 links to)

Assignee:: Balazs Gibizer

Reporter:: Balazs Gibizer

Team:: rhos-workloads-compute

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2022/10/04 8:18 AM

Updated:: 2025/06/11 8:22 PM

Resolved:: 2025/03/20 1:04 AM

Details

Description

Problem

Proposal

Solution proposal

Implementation

Performance results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty