-
Task
-
Resolution: Unresolved
-
Undefined
-
None
-
2026.1 (G)
-
None
-
3
-
False
-
-
False
-
Not Selected
-
rhos-workloads-compute-quasar
-
-
-
Sprint 2 QUasar, Sprint 3 Quasar
-
2
During integration testing with the vmware virt driver we found a possible deadlock situation in nova-compute.
Quoting from upstream review
https://review.opendev.org/c/openstack/nova/+/965467/44#message-62f7ef828c38c33f322f72737b4016ace9b9f242
I found a bug while testing with 10 VMs per compute in parallel (and probably reproduced the oslo.vmware problem).
The following scenario leads to a high level deadlock:
- switch to threading mode, the default executor pool size is 10
- boot 10 VMs in parallel
- compute gets 10 RPC request for build_and_run_instance
- compute moves those request to the default executor due to the logic [1] this makes the default pool full.
- build_and_run_instance tasks are progressing and spawning _allocate_network_async [2] to the same default executor and a bit later waiting for them to finish. But the executor is full due to the parent tasks. So we have a deadlock between the 10 parallel build_and_run_instance and the 10 parallel _allocate_network_async tasks.
Relevant IRC discussion: https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2026-01-30.log.html#openstack-nova.2026-01-30.log.html#t2026-01-30T15:02:52
I will do the following:
- audit all the spawn calls on the compute side to see how many similar cases we have
- move the build_and_run_instance tasks to a dedicated executor that will implement the limit for parallel builds that today is implemented by a semaphore
- look into solutions that can catch the case when a task running in an executor tries to submit a new task to the same executor.
- is depended on by
-
OSPRH-19525 Make nova-compute run in native threading mode
-
- Backlog
-