Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-13567

[17.1] Un-proxied libvirt calls list(All)Devices() can cause nova-compute to freeze for hours

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • rhos-17.1.6
    • None
    • openstack-nova
    • None
    • 2
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • None
    • Moderate

      This is a copy of upstream bug https://bugs.launchpad.net/nova/+bug/2091033


      tl;dr This bug has the same root cause as https://bugs.launchpad.net/nova/+bug/1840912 where items in lists returned from libvirt are not automatically wrapped in a tpool.Proxy.

      Discovered during investigation of a downstream bug [1] where a live migration dirtying memory faster than the transfer and nova-compute became frozen unable to perform any other operations, not even logging, for hours.

      The freezing was tracked down to un-proxied libvirt call listAllDevices() which could block all other greenthreads. The listAllDevices() call occurs during the update_available_resource() periodic task in the libvirt driver in _get_pci_passthrough_devices(). In a GMR collected during a repro of the issue, a traceback showing this was present in the report [2]:

      tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in run_periodic_tasks
      stderr F `task(self, context)`
      stderr F
      stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in update_available_resource
      stderr F `startup=startup)`
      stderr F
      stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in _update_available_resource_for_node
      stderr F `startup=startup)`
      stderr F
      stderr F /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in update_available_resource
      stderr F `resources = self.driver.get_available_resource(nodename)`
      stderr F
      stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in get_available_resource
      stderr F `data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()`
      stderr F
      stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in _get_pci_passthrough_devices
      stderr F `in devices.items() if "pci" in dev.listCaps()]`
      stderr F
      stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in <listcomp>
      stderr F `in devices.items() if "pci" in dev.listCaps()]`
      stderr F
      stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps
      stderr F `ret = libvirtmod.virNodeDeviceListCaps(self._o)`
      

      The listAllDevices() function returned a list of unwrapped virNodeDevice objects and so calling listCaps() on such an unwrapped device could cause a freeze.

      Based on the above, the bug reporter was able to test a patch [3] to wrap listAllDevices() list items in tpool.Proxy and the result showed nova-compute no longer freezing [4] in the aforementioned scenario.

      During investigation it was also noticed that the listDevices() call list items were not tpool.Proxy wrapped, so this is fixed as well in the patch.

      [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
      [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13
      [3] https://review.opendev.org/c/openstack/nova/+/932669
      [4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21

              mwitt@redhat.com melanie witt
              mwitt@redhat.com melanie witt
              rhos-dfg-compute
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: