-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhel-9.7
-
None
-
No
-
Low
-
1
-
rhel-virt-core-live-migration
-
None
-
False
-
False
-
-
None
-
feature & bug fixed planned
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
None
Running a live migration, with the data channel protected by TLS, if the pre-copy phase lasts long enough that a TLS re-key is performed, then after switching to post-copy mode the TLS session will often fail
This is tested with latest c9s packages
qemu-kvm-9.1.0-23.el9.x86_64
gnutls-3.8.3-6.el9.x86_64
but this should apply to all versions of QEMU with post-copy and TLS support, when gnutls negotiates TLS 1.3
It is also reported upstream as
https://gitlab.com/qemu-project/qemu/-/issues/1937
Reproducing this is slow and non-deterministic, because the TLS rekey operation only happens after 16 million TLS packets have been sent, which is a lot of data. On my test machine it takes at least 5 minutes running in pre-copy mode to trigger a re-key.
As a pre-requisite you need
- TLS x509 certificates for client and server in a convenient directory (/home/berrange/tls in this example)
- the memtest.iso image as a means to create lots of dirty guest RAM quickly
First start the target QEMU with incoming migration with postcopy and TLS
$ /usr/libexec/qemu-kvm -display none -m 8000 -smp 8 -accel kvm -qmp stdio -cdrom ~/memtest.iso -incoming defer { "execute": "qmp_capabilities"} {"return": {}} { "execute": "object-add", "arguments":{ "id": "tls0", "qom-type": "tls-creds-x509", "dir": "/home/berrange/tls", "endpoint": "server" }} {"return": {}} { "execute": "migrate-set-capabilities" , "arguments": { "capabilities": [ { "capability": "postcopy-ram", "state": true } ] } } {"return": {}} { "execute": "migrate-set-parameters", "arguments": { "tls-creds": "tls0" } } {"return": {}} { "execute": "migrate-incoming" , "arguments": { "uri": "tcp:localhost:9000" } } {"return": {}}
Then start the source QEMU with outgoing migration with postcopy and TLS
/usr/libexec/qemu-kvm -display none -m 8000 -smp 8 -accel kvm -qmp stdio -cdrom ~/memtest.iso { "execute": "qmp_capabilities"} {"return": {}} { "execute": "object-add", "arguments":{ "id": "tls0", "qom-type": "tls-creds-x509", "dir": "/home/berrange/tls", "endpoint": "client" }} {"return": {}} { "execute": "migrate-set-capabilities" , "arguments": { "capabilities": [ { "capability": "postcopy-ram", "state": true } ] } } {"return": {}} { "execute": "migrate-set-parameters", "arguments": { "tls-creds": "tls0" } } {"return": {}} { "execute": "migrate" , "arguments": { "uri": "tcp:localhost:9000" } } {"return": {}} { "execute": "query-migrate" }
Let it run in post-copy mode for a long time. How long will depend on how fast the machine can do AES encryption and copying localhost network packets.
I run for at least 5 minutes. If migration successfully converges in this time, increase the guest RAM sizes to bigger than 8 GB
Now on the source QEMU run
{ "execute": "migrate-start-postcopy" } {"return": {}}
If you are "lucky", migration will quickly fail on the source QEMU with output looking like this
{"timestamp": {"seconds": 1750351018, "microseconds": 453586}, "event": "STOP"} qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5 qemu-kvm: Unable to shutdown socket: Bad file descriptor qemu-kvm: Detected IO failure for postcopy. Migration paused.
The more useful error can be see from query-migrate:
{ "execute": "query-migrate" } {"return": {"status": "postcopy-paused", "setup-time": 18, "error-desc": "Cannot read from TLS channel: Decryption has failed.", "downtime": 12, "total-time": 295280, "ram": {"total": 8406245376, "postcopy-requests": 0, "dirty-sync-count": 13, "multifd-bytes": 0, "pages-per-second": 1040, "downtime-bytes": 0, "page-size": 4096, "remaining": 4290199552, "postcopy-bytes": 4465167, "mbps": 34.164701298701303, "transferred": 34827235369, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 34822417810, "duplicate": 4266, "dirty-pages-rate": 20384, "normal-bytes": 34758934528, "normal": 8486068}}}
"Decryption has failed" is the indication that GNUTLS has corrupted its internal state due to the rekey operation.
This serious limits the usability of TLS with post-copy migration, given that TLS 1.3 is the out of the box default in RHEL
- blocks
-
RHEL-103240 qemu-kvm crashes on rhel9.2 EUS when live migrating guests that are under heavy load
-
- Closed
-
- depends on
-
RHEL-98672 gnutls corrupts session state with multiple threads due to TLS 1.3 rekeying
-
- Release Pending
-
- relates to
-
RHEL-104382 RFE: provide a way to override QEMU crypto priority for live migration [rhel-10.1]
-
- Release Pending
-