Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-21371

Deployment of 17.1.11 leaves apparently broken communication between Nova and Libvirt

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhos-17.1.11
    • tripleo-ansible
    • None
    • Bug Tracking
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • rhos-ops-day1day2-edpm
    • None
    • Moderate

      To Reproduce Steps to reproduce the behavior:

      1. Deploy 17.1.11 from scratch
      2. Create an instance:
        $ openstack server create --flavor 4c4g --image rhel-9.2 --network testnet --key-name stack --security-group all-open --host overcloud-novacompute-2.enothen-kellerlab.tamlab.brq2.redhat.com --wait -c status -c OS-EXT-SRV-ATTR:host testvm-volume1
        
        +----------------------+------------------------------------------------------------------+
        | Field                | Value                                                            |
        +----------------------+------------------------------------------------------------------+
        | OS-EXT-SRV-ATTR:host | overcloud-novacompute-2.enothen-kellerlab.tamlab.brq2.redhat.com |
        | status               | ACTIVE                                                           |
        +----------------------+------------------------------------------------------------------+
        
      1. Create a volume
        $ openstack volume create --size 1 testvol -c id -c status
        +--------+--------------------------------------+
        | Field  | Value                                |
        +--------+--------------------------------------+
        | id     | e1574718-cf70-474a-bf19-4dc7e52a7d23 |
        | status | creating                             |
        +--------+--------------------------------------+
        $ openstack volume show testvol -c status
        +--------+-----------+
        | Field  | Value     |
        +--------+-----------+
        | status | available |
        +--------+-----------+
        
      1. Add volume to node
        $ date ; openstack server add volume testvm-volume1 testvol ; date
        Wed Oct 29 10:24:08 CET 2025
        Wed Oct 29 10:24:13 CET 2025
        
      1. Remove volume from volume
        $ date ; openstack server remove volume testvm-volume1 testvol ; date
        Wed Oct 29 10:24:27 CET 2025
        Wed Oct 29 10:24:31 CET 2025
        
      1. Meanwhile monitor volume status in a separate window:
        $ while [ True ] ;do echo "$(date): $(openstack volume show testvol -c status -f value)" ;done
        Wed Oct 29 10:24:09 CET 2025: available
        Wed Oct 29 10:24:11 CET 2025: reserved
        Wed Oct 29 10:24:14 CET 2025: attaching
        Wed Oct 29 10:24:17 CET 2025: in-use
        Wed Oct 29 10:24:19 CET 2025: in-use
        Wed Oct 29 10:24:22 CET 2025: in-use
        Wed Oct 29 10:24:25 CET 2025: in-use
        Wed Oct 29 10:24:27 CET 2025: in-use
        Wed Oct 29 10:24:30 CET 2025: detaching
        Wed Oct 29 10:24:32 CET 2025: detaching
        Wed Oct 29 10:24:35 CET 2025: detaching
        Wed Oct 29 10:24:37 CET 2025: detaching
        Wed Oct 29 10:24:40 CET 2025: detaching
        Wed Oct 29 10:24:42 CET 2025: detaching
        Wed Oct 29 10:24:45 CET 2025: detaching
        Wed Oct 29 10:24:47 CET 2025: detaching
        Wed Oct 29 10:24:50 CET 2025: detaching
        Wed Oct 29 10:24:52 CET 2025: available
        Wed Oct 29 10:24:54 CET 2025: available
        Wed Oct 29 10:24:57 CET 2025: available
        
      1. Looking into /var/log/containers/nova/nova-compute.log in the compute node where the vm is hosted, I can see the 20 seconds gap, followed by a timeout and error:
        2025-10-29 10:24:32.496 2 DEBUG nova.virt.libvirt.driver [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] (1/8): Attempting to detach device vdb with device alias virtio-disk1 from instance c562b73e-239c-4192-a981-f32f4be5efd7 from the live domain config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2465
        2025-10-29 10:24:32.496 2 DEBUG nova.virt.libvirt.guest [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] detach device xml: <disk type="network" device="disk">
        2025-10-29 10:24:32.630 2 DEBUG nova.virt.libvirt.driver [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] Start waiting for the detach event from libvirt for device vdb with device alias virtio-disk1 for instance c562b73e-239c-4192-a981-f32f4be5efd7 _detach_from_live_and_wait_for_event /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2541
        ...
        <--- 20 sec gap here
        ...
        2025-10-29 10:24:52.632 2 ERROR nova.virt.libvirt.driver [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] Waiting for libvirt event about the detach of device vdb with device alias virtio-disk1 from instance c562b73e-239c-4192-a981-f32f4be5efd7 is timed out.
        2025-10-29 10:24:52.634 2 INFO nova.virt.libvirt.driver [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] Successfully detached device vdb from instance c562b73e-239c-4192-a981-f32f4be5efd7 from the live domain config.
        
      1. Restarting the nova_virtqemud container in the same compute node:
        # date ; systemctl restart tripleo_nova_virtqemud ; date
        Wed Oct 29 10:48:26 AM CET 2025
        Wed Oct 29 10:49:53 AM CET 2025
        
      1. Attempting to reproduce again:
        $ date ; openstack server add volume testvm-volume1 testvol ; date
        Wed Oct 29 10:51:30 CET 2025
        Wed Oct 29 10:51:35 CET 2025
        $ date ; openstack server remove volume testvm-volume1 testvol ; date
        Wed Oct 29 10:51:42 CET 2025
        Wed Oct 29 10:51:45 CET 2025
        $ date ; openstack server add volume testvm-volume1 testvol ; date
        Wed Oct 29 10:51:52 CET 2025
        Wed Oct 29 10:51:58 CET 2025
        $ date ; openstack server remove volume testvm-volume1 testvol ; date
        Wed Oct 29 10:52:00 CET 2025
        Wed Oct 29 10:52:03 CET 2025
        
      1. Issue is not reproduced:
        $ while [ True ] ;do echo "$(date): $(openstack volume show testvol -c status -f value)" ;done
        Wed Oct 29 10:51:24 CET 2025: available
        Wed Oct 29 10:51:27 CET 2025: available
        Wed Oct 29 10:51:29 CET 2025: available
        Wed Oct 29 10:51:32 CET 2025: reserved
        Wed Oct 29 10:51:34 CET 2025: in-use
        Wed Oct 29 10:51:37 CET 2025: in-use
        Wed Oct 29 10:51:39 CET 2025: in-use
        Wed Oct 29 10:51:42 CET 2025: in-use
        Wed Oct 29 10:51:44 CET 2025: available
        Wed Oct 29 10:51:46 CET 2025: available
        Wed Oct 29 10:51:49 CET 2025: available
        Wed Oct 29 10:51:51 CET 2025: available
        Wed Oct 29 10:51:54 CET 2025: reserved
        Wed Oct 29 10:51:56 CET 2025: reserved
        Wed Oct 29 10:51:59 CET 2025: in-use
        Wed Oct 29 10:52:01 CET 2025: detaching
        Wed Oct 29 10:52:04 CET 2025: available
        Wed Oct 29 10:52:06 CET 2025: available
        Wed Oct 29 10:52:09 CET 2025: available
        Wed Oct 29 10:52:11 CET 2025: available
        

      Expected behavior

      • The volume status should transition to "available" rather quick, should not remain in "detaching" for 20 seconds
      • A manual container restart after a new deployment (or scale out) should not be needed
      • The error in Nova should not happen, and apparently did not happen on 17.1.9

      Environment:

      • RHOSP 17.1.11 (reproduced)
      • Customer reports 17.1.9 does not present the same issue (not attempted to reproduce yet)

      Bug impact

      • Low impact apparently, the volume is anyway removed from the instance after the timeout and error

      Known workaround

      • After deployment or scale-out, do `systemctl restart tripleo_nova_virtqemud`, or more generally, as stack from the undercloud:
        $ ansible -i ~/overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m service -a 'name=tripleo_nova_virtqemud state=restarted' -b Compute
        

      Additional context

      • The issue does not reproduce in environments that are updated from 17.1.9 to 17.1.11
      • In environments where the workaround has been applied, scaling out to a new compute allows reproducing the issue again in the scaled out node
      • Logs are from my own environment, with Nova and Libvirt in Debug. Let me know if any other detail is needed.

              Unassigned Unassigned
              rhn-support-enothen Eric Nothen
              rhos-dfg-compute-downstream-triage
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: