-
Spike
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
Questions to answer during a 1 week timebox:
1. Can this be reproduced with mdevs only?
2. Hash out some WAs to see if they are backportable.
What we know already:
- It can be reproduced with PCI in Placement feature and a set of PFs: https://review.opendev.org/c/openstack/nova/+/855885
- It happens on the field. Two independent report from upstream:
—
Can this be reproduced with mdevs?
Yes
The libvirt driver has a limitation today that a single VM can only allocated mdevs from a single resource provider. See the driver code. Nova actually ignores the additional mdev allocations when generating the domain XML, even though they are allocated in placement. This limitation does not prevent to reproduce the issue as this limitation happens way after the placement allocation candidates are generated and used. But even if the placement issue is fixed the virt driver limitation would prevent using the result.
The main way to reproduce it with mdevs realistically is to use the custom trait support described in https://docs.openstack.org/nova/latest/admin/virtual-gpu.html#optional-provide-custom-traits-for-multiple-gpu-types and requesting multiple GPUs with different traits per VM. This way the user is forced to request the GPUs via separate requests groups in the flavor to be able to specific different traits per groups.
The reproduction patch is now extended with the mdev test case https://review.opendev.org/c/openstack/nova/+/855885
Exact limitations on different features
In general via the limitations we want to limit the number of allocation candidates generated for a given request from a given compute host to avoid:
- only returning candidates from a single host before we hit the default 1000 candidate limit per request
- contrary to our previous assumptions placement does not order the candidates by computes when applying the limit. So the issue is not that if there are more than 1000 candidates for the first compute then candidates will be returned from the first compute only. (Placement uses sets internally so the candidate iterating order at the limiting stage is based on the hash functions of the AllocationRequest objects). Still with no defined order in place it can happen that the response only contains candidates from a single host.
- spending excessive time and memory to generate too many candidates even when most of them are never returned due to the limit
- hitting the HTTP response size limit if the default candidate limit per request is bumped higher than 1000.
- with 1000 candidates the HTTP response has less than 1MB size which is manageable.
- with 100k candidates it is around 75MB where the size alone can trigger problems but probably the time it takes for placement to generate that amount of candidates already leads to timeout on the nova side.
Below I aim for keeping the number of the candidates below 100 per host allowing 10 hosts in the response before hitting the 1000 default candidate limit.
Limitation on mdev support
To avoid the issue the deployment either: (these limitations are in OR relationship so only one is enough to put in place):
- limit the flavors so that a single VM should only request a single mdev device. This is easy to justify today as the libvirt driver does not support more than one devices and ignore the extra devices.
- limit that a single compute host should not use different mdev types differentiated via placement traits.
- do not allow requesting more than 2 mdevs per VM from compute hosts that has 8 devices providing mdevs. This limits the allocation candidates to 64 per host which is probably manageable. (or 4 mdevs per VM but only 3 devices providing mdevs on the host => 81 candidate per host)
Limitation on flavor based PCI passthrough
To avoid the issue the deployment either:
- do not turn on the PCI in Placement feature
- If PCI in Placement feature is needed and turned on then either:
- limit the number of PCI devices configured in nova device_spec
- limit the number of devices requested per VM via the flavor's PCI alias.
device_type PCI or PF
- with 8 devices per host maximum 2 device per VM (56 candidates per host)
- with 5 devices per host maximum 3 device per VM (60 candidates per host)
- with 4 devices per host all the 4 devices can be requested in a single VM (24 candidates per host)
device_type VF
Assuming that each PF supports enough VFs for a single request and group_policy is set to none in the flavor. Otherwise the situation is better as the extra constraints limit the number of candidates further.
- with 8 PFs per host maximum 2 VFs per VM (64 candidates per host)
- with 4 PFs per host maximum 3 VFs per VM (64 candidates per host)
- with 3 PFs per host maximum 4 VFs per VM (81 candidates per host)
- with 2 PFs per host maximum 6 VFs per VM (64 candidates per host)
- with 1 PFs per host the limit is the limit of the VFs per PF supported by the PF.
Possible short term workarounds
Do not generate all the candidates before applying the limit
Even if the limit is set to 1000 today by default from nova, placement generates all the possible candidates then applies the limit. The reason of this is that placement supports randomize_allocation_candidates config option and that promises a random selection of candidates from all the possible candidates. So applying the limit eagerly can only be meaningfully done if randomize_allocation_candidates are turned off.
Limiting the number of candidates generated would solve the problem of:
- spending excessive time and memory to generate too many candidates even when most of them are never returned due to the limit
but did not solve the problem of only returning candidates from a small number of hosts. The below PoC shows that using a reasonable global limit of 100k effectively limits both the time it takes to generate the response and the size of the HTTP response.
Proposal:
- Add a new config option to placement max_allocation_candidates that is set to an arbitrary high default value, e.g. 100k. Allow the value -1 to mean unlimited to allow getting fully back to the legacy behavior.
- Use this config to limit the candidated generation loop independently of the limit in the request.
- Document in the randomize_allocation_candidates config option that it only selects from the max_allocation_candidates number of candidates.
TODO:
- Find a good default value for it via devstack trial and error based on the memory usage and default HTTP response timeout settings.
- 100k seems a good trade of to limit both the memory consumption and runtime of the query while still providing plenty of candidates if needed.
- Decide if we want -1 as the default or have a large but not unlimited default instead.
- Check if a single default value helps avoiding both the timeout issue and the HTTP response size issue if the query's limit is high
- 100k global limit seems to be good for both runtime and response size.
- Decide if we want to return HTTP 409 in case limit in the request is bigger than the configured max_allocation_candidates
PoC: https://review.opendev.org/c/openstack/placement/+/936658
Alternative:
- instead of a global max_allocation_candidates we could have max_allocation_candidates_per_root to avoid that the candidate generation loop exhaust the global limit on the first couple of hosts. But it needs deeper code surgery.
Distribute candidates more evenly across hosts
In a production cloud nova probably wants to rather get a few candidates from a lot of different hosts then to get a lot of candidates from a small number of hosts to maximize the chance to find a good host based on the extra filtering criteria (like AggregateExtraSpec, ComputeCapability or ServerGroup filtering) only nova is aware of. Placement's default behavior (where randomize_allocation_candidates is off) does not guarantee any kind of spreading of candidates across multiple root providers (compute hosts).
Proposal:
- Add a new balanced candidate generation strategy that takes one candidate from each compute in a round-robin fashion.
- Make the new strategy optional along with keeping the legacy compute deep first strategy and the related randomize_allocation_candidates configurable available. Probably via a new config options candidate_generation_order with values exhaust_root_first, roots_round_robin, randomized
TODO:
- How does this affects randomize_allocation_candidates usability? Probably we need to restrict the behavior to either randomize or spread but not both at the same time.
- How does this play with the limit provided by nova and the number of valid roots found by placement?
- What is the memory consumption of the new strategy?
- actually the 100k limit helps with the memory pressure more than how much extra memory the round robin strategy needs. (at least for 2 computes)
I tested the performance characteristics of these two changes (max candidates, and round robin) in devstack with two computes each providing 7 IGB PFs each providing 6 VFs. And a VM that requests 6 VFs.
- On master allocation candidate request took 220 seconds and the placement-api service's RSS moved from 105MB to 650MB during the query, and went back to 120MB after the query returned
- with the fixes applied(100k maximum candidates, round robin strategy) the allocation candidate request took 47 seconds and the memory moved from 104MB to 335MB and then back to 125MB.
- actually the 100k limit helps with the memory pressure more than how much extra memory the round robin strategy needs. (at least for 2 computes)
PoC: https://review.opendev.org/c/openstack/nova-specs/+/936407
Long term solutions
- Move the config driven behavior to the API. I.e add an ordering_strategy (none/randomize/spread) flag to the /allocation_candidates API
- Think about what kind of symmetric candidates placement generates today that does not provide any extra information to nova's scheduler filters. Add a strategy that eliminates / does not generate the symmetric candidates from the result.
- is related to
-
OSPRH-37 Spread allocation candidates between hosts in Placement GET /allocation_candidates query
-
- Closed
-