Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhos-17.1.11
Component/s: tripleo-ansible
Labels:
None

Activity Type:
Bug Tracking
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
AssignedTeam:
rhos-ops-day1day2-edpm
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Severity:
Moderate

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:

Deploy 17.1.11 from scratch

Create an instance:

$ openstack server create --flavor 4c4g --image rhel-9.2 --network testnet --key-name stack --security-group all-open --host overcloud-novacompute-2.enothen-kellerlab.tamlab.brq2.redhat.com --wait -c status -c OS-EXT-SRV-ATTR:host testvm-volume1

+----------------------+------------------------------------------------------------------+
| Field                | Value                                                            |
+----------------------+------------------------------------------------------------------+
| OS-EXT-SRV-ATTR:host | overcloud-novacompute-2.enothen-kellerlab.tamlab.brq2.redhat.com |
| status               | ACTIVE                                                           |
+----------------------+------------------------------------------------------------------+

Create a volume

$ openstack volume create --size 1 testvol -c id -c status
+--------+--------------------------------------+
| Field  | Value                                |
+--------+--------------------------------------+
| id     | e1574718-cf70-474a-bf19-4dc7e52a7d23 |
| status | creating                             |
+--------+--------------------------------------+
$ openstack volume show testvol -c status
+--------+-----------+
| Field  | Value     |
+--------+-----------+
| status | available |
+--------+-----------+

Add volume to node

$ date ; openstack server add volume testvm-volume1 testvol ; date
Wed Oct 29 10:24:08 CET 2025
Wed Oct 29 10:24:13 CET 2025

Remove volume from volume

$ date ; openstack server remove volume testvm-volume1 testvol ; date
Wed Oct 29 10:24:27 CET 2025
Wed Oct 29 10:24:31 CET 2025

Meanwhile monitor volume status in a separate window:

$ while [ True ] ;do echo "$(date): $(openstack volume show testvol -c status -f value)" ;done
Wed Oct 29 10:24:09 CET 2025: available
Wed Oct 29 10:24:11 CET 2025: reserved
Wed Oct 29 10:24:14 CET 2025: attaching
Wed Oct 29 10:24:17 CET 2025: in-use
Wed Oct 29 10:24:19 CET 2025: in-use
Wed Oct 29 10:24:22 CET 2025: in-use
Wed Oct 29 10:24:25 CET 2025: in-use
Wed Oct 29 10:24:27 CET 2025: in-use
Wed Oct 29 10:24:30 CET 2025: detaching
Wed Oct 29 10:24:32 CET 2025: detaching
Wed Oct 29 10:24:35 CET 2025: detaching
Wed Oct 29 10:24:37 CET 2025: detaching
Wed Oct 29 10:24:40 CET 2025: detaching
Wed Oct 29 10:24:42 CET 2025: detaching
Wed Oct 29 10:24:45 CET 2025: detaching
Wed Oct 29 10:24:47 CET 2025: detaching
Wed Oct 29 10:24:50 CET 2025: detaching
Wed Oct 29 10:24:52 CET 2025: available
Wed Oct 29 10:24:54 CET 2025: available
Wed Oct 29 10:24:57 CET 2025: available

Looking into /var/log/containers/nova/nova-compute.log in the compute node where the vm is hosted, I can see the 20 seconds gap, followed by a timeout and error:

2025-10-29 10:24:32.496 2 DEBUG nova.virt.libvirt.driver [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] (1/8): Attempting to detach device vdb with device alias virtio-disk1 from instance c562b73e-239c-4192-a981-f32f4be5efd7 from the live domain config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2465
2025-10-29 10:24:32.496 2 DEBUG nova.virt.libvirt.guest [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] detach device xml: <disk type="network" device="disk">
2025-10-29 10:24:32.630 2 DEBUG nova.virt.libvirt.driver [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] Start waiting for the detach event from libvirt for device vdb with device alias virtio-disk1 for instance c562b73e-239c-4192-a981-f32f4be5efd7 _detach_from_live_and_wait_for_event /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2541
...
<--- 20 sec gap here
...
2025-10-29 10:24:52.632 2 ERROR nova.virt.libvirt.driver [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] Waiting for libvirt event about the detach of device vdb with device alias virtio-disk1 from instance c562b73e-239c-4192-a981-f32f4be5efd7 is timed out.
2025-10-29 10:24:52.634 2 INFO nova.virt.libvirt.driver [req-18478c06-71ac-4934-a0de-d53413bffafd 31a05ca62faf4aed9e7d5e38035efa19 4ef31de303ab4ea2b611301583994bc0 - default default] Successfully detached device vdb from instance c562b73e-239c-4192-a981-f32f4be5efd7 from the live domain config.

Restarting the nova_virtqemud container in the same compute node:

# date ; systemctl restart tripleo_nova_virtqemud ; date
Wed Oct 29 10:48:26 AM CET 2025
Wed Oct 29 10:49:53 AM CET 2025

Attempting to reproduce again:

$ date ; openstack server add volume testvm-volume1 testvol ; date
Wed Oct 29 10:51:30 CET 2025
Wed Oct 29 10:51:35 CET 2025
$ date ; openstack server remove volume testvm-volume1 testvol ; date
Wed Oct 29 10:51:42 CET 2025
Wed Oct 29 10:51:45 CET 2025
$ date ; openstack server add volume testvm-volume1 testvol ; date
Wed Oct 29 10:51:52 CET 2025
Wed Oct 29 10:51:58 CET 2025
$ date ; openstack server remove volume testvm-volume1 testvol ; date
Wed Oct 29 10:52:00 CET 2025
Wed Oct 29 10:52:03 CET 2025

Issue is not reproduced:

$ while [ True ] ;do echo "$(date): $(openstack volume show testvol -c status -f value)" ;done
Wed Oct 29 10:51:24 CET 2025: available
Wed Oct 29 10:51:27 CET 2025: available
Wed Oct 29 10:51:29 CET 2025: available
Wed Oct 29 10:51:32 CET 2025: reserved
Wed Oct 29 10:51:34 CET 2025: in-use
Wed Oct 29 10:51:37 CET 2025: in-use
Wed Oct 29 10:51:39 CET 2025: in-use
Wed Oct 29 10:51:42 CET 2025: in-use
Wed Oct 29 10:51:44 CET 2025: available
Wed Oct 29 10:51:46 CET 2025: available
Wed Oct 29 10:51:49 CET 2025: available
Wed Oct 29 10:51:51 CET 2025: available
Wed Oct 29 10:51:54 CET 2025: reserved
Wed Oct 29 10:51:56 CET 2025: reserved
Wed Oct 29 10:51:59 CET 2025: in-use
Wed Oct 29 10:52:01 CET 2025: detaching
Wed Oct 29 10:52:04 CET 2025: available
Wed Oct 29 10:52:06 CET 2025: available
Wed Oct 29 10:52:09 CET 2025: available
Wed Oct 29 10:52:11 CET 2025: available

Expected behavior

The volume status should transition to "available" rather quick, should not remain in "detaching" for 20 seconds
A manual container restart after a new deployment (or scale out) should not be needed
The error in Nova should not happen, and apparently did not happen on 17.1.9

Environment:

RHOSP 17.1.11 (reproduced)
Customer reports 17.1.9 does not present the same issue (not attempted to reproduce yet)

Bug impact

Low impact apparently, the volume is anyway removed from the instance after the timeout and error

Known workaround

After deployment or scale-out, do `systemctl restart tripleo_nova_virtqemud`, or more generally, as stack from the undercloud:

$ ansible -i ~/overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m service -a 'name=tripleo_nova_virtqemud state=restarted' -b Compute

Additional context

The issue does not reproduce in environments that are updated from 17.1.9 to 17.1.11
In environments where the workaround has been applied, scaling out to a new compute allows reproducing the issue again in the scaled out node
Logs are from my own environment, with Nova and Libvirt in Debug. Let me know if any other detail is needed.

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty