Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-37

Spread allocation candidates between hosts in Placement GET /allocation_candidates query

XMLWordPrintable

    • Spread allocation candiates over hosts
    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • No Docs Impact
    • To Do
    • RHOSSTRAT-222 - DFG:Compute Wishlist
    •  openstack-nova-27.5.2-18.0.20250218144742.el9ost
    • Committed
    • Committed
    • 0% To Do, 0% In Progress, 100% Done
    • Hide
      .Allocation candidates are spread between hosts in Placement GET /allocation_candidates query

      In a deployment with wide and symmetric provider trees, for example, where there are multiple children providers under the same root having inventory from the same resource class, if the allocation candidate request asks for resources from those children resource providers in multiple request groups the number of possible allocation candidates grows rapidly. The Placement service generates these candidates fully before applying the limit parameter provided in the allocation candidate query. The Placement service takes excessive time and memory to generate this amount of allocation candidates and the client might time out.

      To avoid request timeout or out of memory events, a new `[placement]max_allocation_candidates` configuration option is applied during the candidate generation process. By default, the `[placement]max_allocation_candidates` option is set to -1, which means there is no limit and which was the old behavior. Edit the value of this configuration option in affected deployments based on the memory available for the Placement service and the timeout setting of the clients. A suggested value is `100000`.

      If the number of generated allocation candidates is limited by the `[placement]max_allocation_candidates` configuration option, you can get candidates from a limited set of root providers, for example, Compute nodes, as the Placement service uses a depth-first strategy, generating all candidates from the first root before considering the next one. To avoid this, use the `[placement]allocation_candidates_generation_strategy` configuration option, which has two possible values:

      * `depth-first`: generates all candidates from the first viable root provider before moving to the next. This is the default and this triggers the legacy behavior.
      * `breadth-first`: generates candidates from viable roots in a round-robin fashion, creating one candidate from each viable root before creating the second candidate from the first root. This is the possible new behavior.

      In a deployment where `[placement]max_allocation_candidates` is configured to a positive number, set `[placement]allocation_candidates_generation_strategy` to `breadth-first`.
      Show
      .Allocation candidates are spread between hosts in Placement GET /allocation_candidates query In a deployment with wide and symmetric provider trees, for example, where there are multiple children providers under the same root having inventory from the same resource class, if the allocation candidate request asks for resources from those children resource providers in multiple request groups the number of possible allocation candidates grows rapidly. The Placement service generates these candidates fully before applying the limit parameter provided in the allocation candidate query. The Placement service takes excessive time and memory to generate this amount of allocation candidates and the client might time out. To avoid request timeout or out of memory events, a new `[placement]max_allocation_candidates` configuration option is applied during the candidate generation process. By default, the `[placement]max_allocation_candidates` option is set to -1, which means there is no limit and which was the old behavior. Edit the value of this configuration option in affected deployments based on the memory available for the Placement service and the timeout setting of the clients. A suggested value is `100000`. If the number of generated allocation candidates is limited by the `[placement]max_allocation_candidates` configuration option, you can get candidates from a limited set of root providers, for example, Compute nodes, as the Placement service uses a depth-first strategy, generating all candidates from the first root before considering the next one. To avoid this, use the `[placement]allocation_candidates_generation_strategy` configuration option, which has two possible values: * `depth-first`: generates all candidates from the first viable root provider before moving to the next. This is the default and this triggers the legacy behavior. * `breadth-first`: generates candidates from viable roots in a round-robin fashion, creating one candidate from each viable root before creating the second candidate from the first root. This is the possible new behavior. In a deployment where `[placement]max_allocation_candidates` is configured to a positive number, set `[placement]allocation_candidates_generation_strategy` to `breadth-first`.
    • Bug Fix
    • Done
    • Release Notes
    • Red Hat OpenStack Services on OpenShift (formerly Red Hat OpenStack Platform)

      Problem

      When a compute host Resource Provider (RP) has multiple similar child RPs for example multiple PGPU RPs or multiple PCI RPs then the possible number of allocation candidates for a given host increases rapidly. Placement eagerly evaluate all the possible candidates today leading to timeouts in /allocation_candidates query.

      Placement GET /allocation_candidates API returns all possible allocation candidates for a given host. If the request has multiple groups requesting the same resources and traits and such group can be fulfilled by multiple providers from the same tree then placement generates all group - RP combination 1). In worst case this is an exponential algorithm.

      With 4 RPs where each provides one device and 3 requested devices the first group can fit to 4 RPs then the next can fit to 3 then the last can fit to 2. So Placement returns 4*3*2 = 24 candidates. With 8 RPs and 4 devices this is 8*7*6*5 = 1680. With 8 RPs and 8 devices it is 8! = 40320

      Note that if each RP alone can provide resources for every groups then the candidate generation logic always produces len(RP)^len(groups) candidates. For 8 RPs and 4 groups that is 4096 candidates. For 8 RPs and 8 groups that ~16M candidates.

      Proposal

      Improve the allocation candidate generation logic to

      1. spread the candidates between hosts instead of generating all candidates for a given host first
      2. apply the requested candidate number limit during the candidate generation instead of after the all the candidates are generated.
      3. long term we might need more granular limits in the /allocation_candidates API e.g. to allow limit the number of candidates returned by root providers (compute nodes).

      Note that there is a upstream functional test proposed that shows the issue.

      Solution proposal

      See the proposal and reasoning in the related spike OSPRH-11950

      Implementation

      See the patches https://review.opendev.org/q/topic:%22bug/2070257%22

      The release notes quoted below for summary from https://review.opendev.org/c/openstack/placement/+/936832/7/releasenotes/notes/bug-2070257-allocation-candidates-generation-limit-and-strategy.yaml-e73796898163fb55.yaml:

      In a deployment with wide and symmetric provider trees, i.e. where there
      are multiple children providers under the same root having inventory from
      the same resource class (e.g. in case of nova's mdev GPU or PCI in
      Placement features) if the allocation candidate request asks for resources
      from those children RPs in multiple request groups the number of possible
      allocation candidates grows rapidly.
      E.g.:

      • 1 root, 8 child RPs with 1 unit of resource each
        a_c requests 6 groups with 1 unit of resource each
        => 8*7*6*5*4*3=20160 possible candidates
      • 1 root, 8 child RPs with 6 unit of resources each
        a_c requests 6 groups with 6 unit of resources each
        => 8^6=262144 possible candidates

      Placement generates these candidates fully before applying the limit
      parameter provided in the allocation candidate query to be able do a random
      sampling if ``[placement]randomize_allocation_candidates`` is True.

      Placement takes excessive time and memory to generates these amount of
      allocation candidates and the client might times out waiting for the
      response or the Placement API service runs out of memory and crashes.

      To avoid request timeout or out of memory events a new
      ``[placement]max_allocation_candidates`` config option is implemented. This
      limit is applied not after like the request limit but during the
      candidate generation process. So this new option can be used to limit the
      runtime and memory consumption of the Placement API service.

      The new config option is defaulted to ``-1``, meaning no limit, to keep the
      legacy behavior. We suggest to tune this config in the affected
      deployments based on the memory available for the Placement service and the
      timeout setting of the clients. A good initial value could be around
      ``100000``.

      If the number of generated allocation candidates are limited by
      ``[placement]max_allocation_candidates`` config option then it is possible
      to get candidates from a limited set of root providers (e.g. computes
      nodes) as placement uses a depth-first strategy, i.e. generating all
      candidates from the first root before considering the next one. To avoid
      issue a new config option
      ``[placement]allocation_candidates_generation_strategy`` is introduced
      with two possible values:

      • ``depth-first``, generates all candidates from the first viable root
        provider before moving to the next. This is the default and this
        triggers the old behavior
      • ``breadth-first``, generates candidates from viable roots in a
        round-robin fashion, creating one candidate from each viable root
        before creating the second candidate from the first root. This is the
        possible behavior.

      In a deployment where ``[placement]max_allocation_candidates`` is
      configured to a positive number we recommend to set
      ``[placement]allocation_candidates_generation_strategy`` to
      ``breadth-first``.

      Performance results

      https://docs.google.com/spreadsheets/d/179hQJ2e40d7MGc6UNVoACGMai2kXcY-52qEKAGrtyiw/edit?usp=sharing

              rh-ee-bgibizer Balazs Gibizer
              rh-ee-bgibizer Balazs Gibizer
              rhos-workloads-compute
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: