-
Bug
-
Resolution: Duplicate
-
Normal
-
None
-
rhos-17.1.z
-
False
-
-
False
-
None
-
-
Description of problem:
While live migrating a VM that is under heavy memory pressure, connection to RabbitMQ is dropped and migration is aborted.
Version-Release number of selected component (if applicable):
16.2
How reproducible:
Reported by customer while live migrating VMs (OCP workers). Reproduced in lab environment with the same symptoms and consequences.
Steps to Reproduce:
1. Start a VM and run stress-ng on it to fill and continuously update memory (or deploy OCP worker if preferred)
2. Start live-migration and monitor progress running these commands in loop:
~~~
date
openstack server show $server_id -c progress -f yaml
nova server-migration-show $server_id $migration_id | grep memory_ | sed -e 's/| //g;s/|$//g'
~~~
Actual results:
3. Initially, migration progresses normally:
~~~
Thu Sep 12 16:11:55 CEST 2024
progress: 0
memory_processed_bytes -
memory_remaining_bytes -
memory_total_bytes -
Thu Sep 12 16:12:21 CEST 2024
progress: 4
memory_processed_bytes 2211554969
memory_remaining_bytes 32076713984
memory_total_bytes 34364792832
~~~
4. Eventually, the rate at which memory is updated is higher than the speed of copy, and progress falls back. At this point, the migration has already copied over 12 times the memory size of the VM, but this is known and (I guess) normal:
~~~
Thu Sep 12 17:22:38 CEST 2024
progress: 94
memory_processed_bytes 428854981180
memory_remaining_bytes 834887680
memory_total_bytes 34364792832
Thu Sep 12 17:23:07 CEST 2024
progress: 31
memory_processed_bytes 431996388459
memory_remaining_bytes 21890752512
memory_total_bytes 34364792832
Thu Sep 12 17:23:33 CEST 2024
progress: 41
memory_processed_bytes 434533618219
memory_remaining_bytes 19336859648
memory_total_bytes 34364792832
Thu Sep 12 17:23:58 CEST 2024
progress: 49
memory_processed_bytes 437340261899
memory_remaining_bytes 16534937600
memory_total_bytes 34364792832
5. Afterwards, progress and memory metrics are no longer updated, which suggest lack of updates on the database for ~22 minutes (17:25 to 17:47):
~~~
Thu Sep 12 17:24:25 CEST 2024
progress: 63
memory_processed_bytes 441992684877
memory_remaining_bytes 11881574400
memory_total_bytes 34364792832
Thu Sep 12 17:24:52 CEST 2024
progress: 74
memory_processed_bytes 445978914886
memory_remaining_bytes 7899877376
memory_total_bytes 34364792832
Thu Sep 12 17:25:17 CEST 2024
progress: 81
memory_processed_bytes 448567785763
memory_remaining_bytes 5316001792
memory_total_bytes 34364792832
Thu Sep 12 17:25:41 CEST 2024
progress: 81
memory_processed_bytes 448567785763
memory_remaining_bytes 5316001792
memory_total_bytes 34364792832
— snip —
Thu Sep 12 17:47:38 CEST 2024
progress: 81
memory_processed_bytes 448567785763
memory_remaining_bytes 5316001792
memory_total_bytes 34364792832
Thu Sep 12 17:47:59 CEST 2024
progress: 81
memory_processed_bytes 448567785763
memory_remaining_bytes 5316001792
memory_total_bytes 34364792832
~~~
6. During this window, the healthcheck on nova_compute fails, suggesting already a problem connecting to Rabbit:
~~~
$ sudo grep tripleo_nova_compute_healthcheck.service /var/log/messages
...
Sep 12 17:25:23 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Succeeded.
Sep 12 17:27:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Succeeded.
Sep 12 17:29:28 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:29:28 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:31:17 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:31:17 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:32:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:32:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:34:15 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:34:15 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:35:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:35:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:37:29 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:37:29 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:39:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:39:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:41:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:41:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:43:19 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:43:19 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:44:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:44:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:46:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:46:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:48:24 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Succeeded.
Sep 12 17:50:23 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Succeeded.
~~~
7. For whatever reason, nova-compute.log is not updated anymore during this window, even when debug is enabled (see jump from 17:25 to 17:48) in nova-compute.log:
~~~
2024-09-12 17:25:10.779 8 DEBUG nova.virt.libvirt.migration [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:521
2024-09-12 17:25:10.880 8 DEBUG nova.virt.libvirt.driver [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] skipping disk for instance-000000e6 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9918
2024-09-12 17:25:10.886 8 DEBUG nova.virt.libvirt.driver [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] skipping disk for instance-000000e9 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9918
2024-09-12 17:25:10.891 8 DEBUG nova.virt.libvirt.driver [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] skipping disk for instance-000000c7 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9918
2024-09-12 17:48:05.970 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [66ce8f67-540f-41a8-8cc6-ccb361f8b79e] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
~~~
8. Immediately after the error connecting to Rabbit is reported, the traceback comes and the migration is aborted:
~~~
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: ConnectionResetError: [Errno 104] Connection reset by peer
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 382, in _callback
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit self.callback(RabbitMessage(message))
2024-09-12 17:48:05.973 8 ERROR oslo.messaging.drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 497, in __call_
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit message.acknowledge()
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 237, in acknowledge
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit self._raw_message.ack()
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/kombu/message.py", line 126, in ack
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit self.channel.basic_ack(self.delivery_tag, multiple=multiple)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/amqp/channel.py", line 1394, in basic_ack
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit spec.Basic.Ack, argsig, (delivery_tag, multiple),
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/amqp/abstract_channel.py", line 59, in send_method
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit conn.frame_writer(1, self.channel_id, sig, args, content)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/amqp/method_framing.py", line 172, in write_frame
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit write(view[:offset])
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/amqp/transport.py", line 252, in write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit self._write(s)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/amqp/transport.py", line 377, in _write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit n = write(s)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 166, in write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit super(GreenSSLSocket, self).write, data)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 147, in _call_trampolining
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit return func(*a, **kw)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib64/python3.6/ssl.py", line 884, in write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit return self._sslobj.write(data)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib64/python3.6/ssl.py", line 601, in write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit return self._sslobj.write(data)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit ConnectionResetError: [Errno 104] Connection reset by peer
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit
2024-09-12 17:48:06.056 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [50b4b3a6-d64c-438c-8970-78599a048ddd] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2024-09-12 17:48:06.063 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [0c470004-08a3-4caa-ab53-f1d4a396c013] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
2024-09-12 17:48:06.064 8 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2024-09-12 17:48:06.086 8 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2024-09-12 17:48:06.096 8 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2024-09-12 17:48:06.107 8 WARNING nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Error monitoring migration: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Traceback (most recent call last):
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 425, in get
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] return self._queues[msg_id].get(block=True, timeout=timeout)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] return waiter.wait()
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] return get_hub().switch()
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] return self.greenlet.switch()
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] queue.Empty
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] During handling of the above exception, another exception occurred:
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Traceback (most recent call last):
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9382, in _live_migration
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] finish_event, disk_paths)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9256, in _live_migration_monitor
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] info, remaining)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py", line 557, in save_stats
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] migration.save()
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] ctxt, self, fn._name_, args, kwargs)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 246, in object_action
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] objmethod=objmethod, args=args, kwargs=kwargs)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] transport_options=self.transport_options)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] transport_options=transport_options)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] transport_options=transport_options)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 662, in _send
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] call_monitor_timeout)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 551, in wait
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] message = self.waiters.get(msg_id, timeout=timeout)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 429, in get
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] 'to message ID %s' % msg_id)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.153 8 ERROR nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Live Migration failure: operation failed: migration out job: unexpectedly failed: libvirt.libvirtError: operation failed: migration out job: unexpectedly failed
2024-09-12 17:48:06.154 8 DEBUG nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Migration operation thread notification thread_finished /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9369
2024-09-12 17:48:06.194 8 DEBUG nova.compute.resource_tracker [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] Hypervisor/Node resource view: name=overcloud-novacompute-1.keller.lab free_ram=29768MB free_disk=1535.8201903514564GB free_vcpus=3 pci_devices=[
,
{"dev_id": "pci_0000_05_00_0", "address": "0000:05:00.0", "product_id": "1042", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1042", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_02_5", "address": "0000:00:02.5", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_02_2", "address": "0000:00:02.2", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_02_6", "address": "0000:00:02.6", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_01_0", "address": "0000:00:01.0", "product_id": "1111", "vendor_id": "1234", "numa_node": null, "label": "label_1234_1111", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_02_7", "address": "0000:00:02.7", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_02_4", "address": "0000:00:02.4", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_02_1", "address": "0000:00:02.1", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_02_0", "address": "0000:00:02.0", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_02_3", "address": "0000:00:02.3", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_1f_0", "address": "0000:00:1f.0", "product_id": "2918", "vendor_id": "8086", "numa_node": null, "label": "label_8086_2918", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_1f_3", "address": "0000:00:1f.3", "product_id": "2930", "vendor_id": "8086", "numa_node": null, "label": "label_8086_2930", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_07_00_0", "address": "0000:07:00.0", "product_id": "1044", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1044", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_01_00_0", "address": "0000:01:00.0", "product_id": "1041", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1041", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_00_0", "address": "0000:00:00.0", "product_id": "29c0", "vendor_id": "8086", "numa_node": null, "label": "label_8086_29c0", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_03_00_0", "address": "0000:03:00.0", "product_id": "000d", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000d", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_00_1f_2", "address": "0000:00:1f.2", "product_id": "2922", "vendor_id": "8086", "numa_node": null, "label": "label_8086_2922", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_04_00_0", "address": "0000:04:00.0", "product_id": "1043", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1043", "dev_type": "type-PCI"},
{"dev_id": "pci_0000_02_00_0", "address": "0000:02:00.0", "product_id": "1041", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1041", "dev_type": "type-PCI"}] _report_hypervisor_resource_view /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:1063
2024-09-12 17:48:06.195 8 DEBUG oslo_concurrency.lockutils [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2024-09-12 17:48:06.223 8 ERROR nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Failed to cancel migration Requested operation is not valid: no job is active on the domain: libvirt.libvirtError: Requested operation is not valid: no job is active on the domain
2024-09-12 17:48:06.240 8 DEBUG nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Live migration monitoring is all done _live_migration /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9401
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Live migration failed.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Traceback (most recent call last):
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 425, in get
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] return self._queues[msg_id].get(block=True, timeout=timeout)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] return waiter.wait()
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] return get_hub().switch()
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] return self.greenlet.switch()
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] queue.Empty
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] During handling of the above exception, another exception occurred:
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Traceback (most recent call last):
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 7466, in _do_live_migration
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] block_migration, migrate_data)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8858, in live_migration
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] migrate_data)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9382, in _live_migration
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] finish_event, disk_paths)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9256, in _live_migration_monitor
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] info, remaining)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py", line 557, in save_stats
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] migration.save()
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] ctxt, self, fn._name_, args, kwargs)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 246, in object_action
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] objmethod=objmethod, args=args, kwargs=kwargs)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] transport_options=self.transport_options)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] transport_options=transport_options)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] transport_options=transport_options)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 662, in _send
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] call_monitor_timeout)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 551, in wait
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] message = self.waiters.get(msg_id, timeout=timeout)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 429, in get
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] 'to message ID %s' % msg_id)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
~~~
9. Immediately after the traceback, the service is able to reconnect to Rabbit:
~~~
2024-09-12 17:48:07.072 8 INFO oslo.messaging._drivers.impl_rabbit [-] [66ce8f67-540f-41a8-8cc6-ccb361f8b79e] Reconnected to AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 via [amqp] client with port 44858.
2024-09-12 17:48:07.101 8 INFO oslo.messaging._drivers.impl_rabbit [-] [0c470004-08a3-4caa-ab53-f1d4a396c013] Reconnected to AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 via [amqp] client with port 36704.
2024-09-12 17:48:07.116 8 INFO oslo.messaging._drivers.impl_rabbit [-] [50b4b3a6-d64c-438c-8970-78599a048ddd] Reconnected to AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 via [amqp] client with port 36696.
2024-09-12 17:48:07.119 8 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 389a868aba184d1a9956337bca78a569
~~~
10. Additionally, Nova only reports Rabbit as unreachable exactly during this live migration window, not before or after:
~~~
root@overcloud-novacompute-1 ~]# egrep "AMQP server on overcloud-controller-..ctlplane.keller.lab:5672 is unreachable" /var/log/containers/nova/nova-compute.log
2024-09-12 17:48:05.970 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [66ce8f67-540f-41a8-8cc6-ccb361f8b79e] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
2024-09-12 17:48:06.056 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [50b4b3a6-d64c-438c-8970-78599a048ddd] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2024-09-12 17:48:06.063 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [0c470004-08a3-4caa-ab53-f1d4a396c013] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
~~~
10. Checking for possible errors of Rabbit, in the controllers, I can see that the containers were last restarted days before the migration:
~~~
$ ansible -i inventory.yaml -m shell -a 'podman inspect $(podman ps -q -f name=rabbit) | jq .[].State.StartedAt' -b Controller
overcloud-controller-1 | CHANGED | rc=0 >>
"2024-09-09T14:48:31.592566484+02:00"
overcloud-controller-2 | CHANGED | rc=0 >>
"2024-09-09T14:50:18.414106543+02:00"
overcloud-controller-0 | CHANGED | rc=0 >>
"2024-09-09T14:45:50.453585594+02:00"
~~~
11. Rabbit reports some errors, but not during the live migration window:
~~~
$ ansible -i inventory.yaml -m shell -a 'zgrep error /var/log/containers/stdouts/rabbitmq-bundle.log* |grep "2024-09-12" | cut -d: -f-3 | sort -u' -b Controller
overcloud-controller-2 | CHANGED | rc=0 >>
overcloud-controller-1 | CHANGED | rc=0 >>
/var/log/containers/stdouts/rabbitmq-bundle.log.2.gz:2024-09-12T16:01
/var/log/containers/stdouts/rabbitmq-bundle.log.2.gz:2024-09-12T16:03
overcloud-controller-0 | CHANGED | rc=0 >>
/var/log/containers/stdouts/rabbitmq-bundle.log.2.gz:2024-09-12T16:01
~~~
12. It seems the migration never reached post_copy, it was aborted before that. Because post_copy is enabled, auto_converge is never triggered either.
Expected results:
1. post_copy should be triggered if enabled
2. Even when a single migration is stuck, nova on the source compute node should not stop logging to nova-compute.log
3. The connection to Rabbit should not break during live migration, and the healthcheck of nova compute should not fail as a result
Additional info:
This issue is not reproduced when disabling post copy. While the live migration above failed after running for over an hour and a half, and copying an amount of ram well over 13 times the ram size of the vm (memory processing metric stops being updated, but the process kept copying all the time), migrating another vm with the same flavor and workload can successfully complete live migration in a fraction of that time, and copying less than 2x the size of the ram:
~~~
(overcloud) [stack@director.keller.lab ~]$ ansible -i inventory.yaml -m shell -a 'crudini --set /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf libvirt live_migration_permit_post_copy False ; systemctl restart tripleo_nova_compute.service' -b Compute
overcloud-novacompute-0 | CHANGED | rc=0 >>
overcloud-novacompute-1 | CHANGED | rc=0 >>
(overcloud) [stack@director.keller.lab ~]$
(overcloud) [stack@director.keller.lab ~]$ date ; time openstack server migrate --live-migration e81bfb27-89e9-4218-95f2-9fa9104f87f5 --wait ; date
Fri Sep 13 15:43:38 CEST 2024
Progress: 100Complete
real 8m56.001s
user 0m2.376s
sys 0m0.604s
Fri Sep 13 15:52:34 CEST 2024
(overcloud) [stack@director.keller.lab ~]$
~~~
~~~
[tripleo-admin@overcloud-controller-0 ~]$ sudo podman exec $(sudo podman ps -q -f name=galera) mysql nova -e 'select created_at,updated_at,status,memory_total,memory_processed,memory_remaining from migrations where id=3047\G'
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 1. row ***************************
created_at: 2024-09-13 13:43:56
updated_at: 2024-09-13 13:52:30
status: completed
memory_total: 34364792832
memory_processed: 60154154600
memory_remaining: 141168640
[tripleo-admin@overcloud-controller-0 ~]$
~~~
- 1. row ***************************
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
The customer reports the same type of improvements when disabling post copy in the case attached.
Finally, while a failed live migration is an operation that can be recovered, what's more serious is that we are seeing other operations partially completed while the migration is stuck like shown above. For example, adding or detaching volumes from the vm might complete in cinder but not in nova, or the other way around. This causes incidents that have to be fixed with the cinder cli in the best case, but frequently using sql updates.
- duplicates
-
OSPRH-12264 Rapid live migration monitor database writes can lead to MessagingTimeout
-
- In Progress
-
- is related to
-
OSPRH-12217 Un-proxied libvirt calls list(All)Devices() can cause nova-compute to freeze for hours
-
- In Progress
-
-
OSPRH-13567 [17.1] Un-proxied libvirt calls list(All)Devices() can cause nova-compute to freeze for hours
-
- In Progress
-
- external trackers