Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-26091

Fix deadlocks due to child task using the parent's executor

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 2026.1 (G)
    • openstack-nova
    • None
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • rhos-workloads-compute-quasar
    • Sprint 2 QUasar, Sprint 3 Quasar
    • 2

      During integration testing with the vmware virt driver we found a possible deadlock situation in nova-compute.

      Quoting from upstream review
      https://review.opendev.org/c/openstack/nova/+/965467/44#message-62f7ef828c38c33f322f72737b4016ace9b9f242

      I found a bug while testing with 10 VMs per compute in parallel (and probably reproduced the oslo.vmware problem).

      The following scenario leads to a high level deadlock:

      • switch to threading mode, the default executor pool size is 10
      • boot 10 VMs in parallel
      • compute gets 10 RPC request for build_and_run_instance
      • compute moves those request to the default executor due to the logic [1] this makes the default pool full.
      • build_and_run_instance tasks are progressing and spawning _allocate_network_async [2] to the same default executor and a bit later waiting for them to finish. But the executor is full due to the parent tasks. So we have a deadlock between the 10 parallel build_and_run_instance and the 10 parallel _allocate_network_async tasks.

      [1] https://github.com/openstack/nova/blob/59a7093915298973c72b6d1749a6acd27e0045a9/nova/compute/manager.py#L2452-L2460

      [2] https://github.com/openstack/nova/blob/59a7093915298973c72b6d1749a6acd27e0045a9/nova/network/model.py#L580

      Relevant IRC discussion: https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2026-01-30.log.html#openstack-nova.2026-01-30.log.html#t2026-01-30T15:02:52

      I will do the following:

      • audit all the spawn calls on the compute side to see how many similar cases we have
      • move the build_and_run_instance tasks to a dedicated executor that will implement the limit for parallel builds that today is implemented by a semaphore
      • look into solutions that can catch the case when a task running in an executor tries to submit a new task to the same executor.

              rh-ee-bgibizer Balazs Gibizer
              rh-ee-bgibizer Balazs Gibizer
              Ghanshyam Maan, Sylvain Bauza
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: