Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-98671

Migration with TLS often breaks when TLS 1.3 is negotiated when return-path is present

Linking RHIVOS CVEs to...Migration: Automation ...Sync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhel-9.7
    • None
    • No
    • Low
    • 1
    • rhel-virt-core-live-migration
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • feature & bug fixed planned
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • None

      Running a live migration, with the data channel protected by TLS, if the pre-copy phase lasts long enough that a TLS re-key is performed, then after switching to post-copy mode the TLS session will often fail

      This is tested with latest c9s packages

      qemu-kvm-9.1.0-23.el9.x86_64
      gnutls-3.8.3-6.el9.x86_64

      but this should apply to all versions of QEMU with post-copy and TLS support, when gnutls negotiates TLS 1.3

      It is also reported upstream as

      https://gitlab.com/qemu-project/qemu/-/issues/1937

      Reproducing this is slow and non-deterministic, because the TLS rekey operation only happens after 16 million TLS packets have been sent, which is a lot of data. On my test machine it takes at least 5 minutes running in pre-copy mode to trigger a re-key.

       As a pre-requisite you need

      • TLS x509 certificates for client and server in a convenient directory (/home/berrange/tls in this example)
      • the memtest.iso image as a means to create lots of dirty guest RAM quickly

      First start the target QEMU with incoming migration with postcopy and TLS

      $ /usr/libexec/qemu-kvm  -display none -m 8000 -smp 8 -accel kvm  -qmp stdio -cdrom ~/memtest.iso -incoming defer
      { "execute": "qmp_capabilities"}
      {"return": {}}
      { "execute": "object-add", "arguments":{ "id": "tls0", "qom-type": "tls-creds-x509", "dir": "/home/berrange/tls", "endpoint": "server" }}
      {"return": {}}
      { "execute": "migrate-set-capabilities" , "arguments": { "capabilities": [ { "capability": "postcopy-ram", "state": true } ] } }
      {"return": {}}
      { "execute": "migrate-set-parameters", "arguments": { "tls-creds": "tls0" } }
      {"return": {}}
      { "execute": "migrate-incoming" , "arguments": { "uri": "tcp:localhost:9000" } }
      {"return": {}}
      

       Then start the source QEMU with outgoing migration with postcopy and TLS

      /usr/libexec/qemu-kvm  -display none -m 8000 -smp 8 -accel kvm  -qmp stdio -cdrom ~/memtest.iso 
      { "execute": "qmp_capabilities"}
      {"return": {}}
      { "execute": "object-add", "arguments":{ "id": "tls0", "qom-type": "tls-creds-x509", "dir": "/home/berrange/tls", "endpoint": "client" }}
      {"return": {}}
      { "execute": "migrate-set-capabilities" , "arguments": { "capabilities": [ { "capability": "postcopy-ram", "state": true } ] } }
      {"return": {}}
      { "execute": "migrate-set-parameters", "arguments": { "tls-creds": "tls0" } }
      {"return": {}}
      { "execute": "migrate" , "arguments": { "uri": "tcp:localhost:9000" } }
      {"return": {}}
      { "execute": "query-migrate" }
      

      Let it run in post-copy mode for a long time. How long will depend on how fast the machine can do AES encryption and copying localhost network packets.

      I run for at least 5 minutes. If migration successfully converges in this time, increase the guest RAM sizes to bigger than 8 GB

      Now on the source QEMU run

      { "execute": "migrate-start-postcopy" }
      {"return": {}}
      

      If you are "lucky", migration will quickly fail on the source QEMU with output looking like this

      {"timestamp": {"seconds": 1750351018, "microseconds": 453586}, "event": "STOP"}
      qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5
      qemu-kvm: Unable to shutdown socket: Bad file descriptor
      qemu-kvm: Detected IO failure for postcopy. Migration paused.
      

      The more useful error can be see from query-migrate:

      { "execute": "query-migrate" }
      {"return": {"status": "postcopy-paused", "setup-time": 18, "error-desc": "Cannot read from TLS channel: Decryption has failed.", "downtime": 12, "total-time": 295280, "ram": {"total": 8406245376, "postcopy-requests": 0, "dirty-sync-count": 13, "multifd-bytes": 0, "pages-per-second": 1040, "downtime-bytes": 0, "page-size": 4096, "remaining": 4290199552, "postcopy-bytes": 4465167, "mbps": 34.164701298701303, "transferred": 34827235369, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 34822417810, "duplicate": 4266, "dirty-pages-rate": 20384, "normal-bytes": 34758934528, "normal": 8486068}}}
      

      "Decryption has failed" is the indication that GNUTLS has corrupted its internal state due to the rekey operation.

      This serious limits the usability of TLS with post-copy migration, given that TLS 1.3 is the out of the box default in RHEL

              rh-ee-jmarcin Juraj Marcin
              rhn-engineering-berrange Daniel Berrangé
              Daniel Berrangé
              virt-maint virt-maint
              Xiaohui Li Xiaohui Li
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

                Created:
                Updated: