Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-52635

External snapshot delete causes VM to hang

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhel-9.5
    • qemu-kvm / Storage
    • No
    • Moderate
    • rhel-sst-virtualization-storage
    • ssg_virtualization
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Red Hat Enterprise Linux
    • None
    • s390x
    • None

      What were you trying to do that didn't work?

      Write to xfs filesystem after deleting an external snapshot.

      Please provide the package NVR for which bug is seen:

      libvirt-10.5.0-4.el9.s390x
      qemu-kvm-9.0.0-7.el9.s390x

      guest kernel: 5.14.0-480

      How reproducible:

      10%

      Steps to reproduce

      1. Write to a file in the guest (use dd)
      2. Create an external snapshot (with memspec and diskspec)
      3. Write to a file in the guest (use dd)
      4. Delete the external snapshot
      5. Repeat and mix up these steps

      Use the attached script.sh to reproduce this more reliably:

      1. Start a VM with console on unix socket to send commands
        <serial type="unix">
          <source mode="bind" path="/tmp/vm"/>
          <target type="sclp-serial" port="0">
            <model name="sclpconsole"/>
          </target>
        </serial>
      2. Log in through
        nc -U /tmp/vm
      3. Close the connection (CTRL+C)
      4. In a console run
        while true; do sh script.sh; done
      5. At some point the dd command won't return, so there's no output anymore like "51200 bytes (51 kB, 50 KiB) copied, 0.000325012 s, 158 MB/s", instead the command will be followed by the 'virsh snapshot-delete' command. This is when you can stop the script and try to access the console again
        nc -U /tmp/vm

      Expected results

      All actions succeed.

      Actual results

      At some point the console becomes unresponsive. Inspecting the memory dump we can see at least one uninterruptible (UN) task for dd:

      PID: 1587     TASK: 7cf5c00           CPU: 0    COMMAND: "dd"
       #0 [380000af258] __schedule at 28505518
       #1 [380000af2d8] schedule at 2850583e
       #2 [380000af308] io_schedule at 2850598a
       #3 [380000af338] folio_wait_bit_common at 27db4cdc
       #4 [380000af418] folio_wait_writeback at 27dbf3be
       #5 [380000af450] __filemap_fdatawait_range at 27db34a0
       #6 [380000af540] filemap_write_and_wait_range at 27db7580
       #7 [380000af588] xfs_setattr_size at 3ff7fd3e222 [xfs]
       #8 [380000af610] xfs_vn_setattr at 3ff7fd3e51c [xfs]
       #9 [380000af668] notify_change at 27ed81b4
      #10 [380000af718] do_truncate at 27ea9f84
      #11 [380000af7c0] do_open at 27ec1486
      #12 [380000af830] path_openat at 27ec3f7c
      #13 [380000af898] do_filp_open at 27ec53c8
      #14 [380000af9c0] do_sys_openat2 at 27eab550
      #15 [380000afa28] do_sys_open at 27eabada
      #16 [380000afa70] __do_syscall at 284fddf0
      #17 [380000afe98] system_call at 2850cb18
       PSW:  0705000180000000 000003ffb1c00670 (user space)
       GPRS: 0000000000000000 0000000000000120 ffffffffffffffda 000003ffe6e7ad22 
             0000000000000241 00000000000001b6 00000000000001b6 000003ffe6e7ad2b 
             00000000000001b6 000003ffe6e7ad22 000003ffb1ef66a0 0000000000000241 
             000003ffb1eaff68 0000000000000241 000002aa2788a28e 000003ffe6e797c0 
      

      Other uninterruptible tasks might be listed, too:

      PID: 545      TASK: 2779700           CPU: 0    COMMAND: "xfsaild/dm-0"
       #0 [37fffe0b880] __schedule at 936d518
       #1 [37fffe0b900] schedule at 936d83e
       #2 [37fffe0b930] io_schedule at 936d98a
       #3 [37fffe0b960] rq_qos_wait at 8ebae90
       #4 [37fffe0ba00] wbt_wait at 8ee5228
       #5 [37fffe0ba58] __rq_qos_throttle at 8eba99e
       #6 [37fffe0ba90] blk_mq_submit_bio at 8eaa6c8
       #7 [37fffe0bb20] __submit_bio_noacct at 8e98934
       #8 [37fffe0bb70] _xfs_buf_ioapply at 3ff7fd1b9e2 [xfs]
       #9 [37fffe0bc38] __xfs_buf_submit at 3ff7fd1bbbe [xfs]
      #10 [37fffe0bc70] xfs_buf_delwri_submit_buffers at 3ff7fd1c36c [xfs]
      #11 [37fffe0bd00] xfsaild_push at 3ff7fd5dc84 [xfs]
      #12 [37fffe0bdb8] xfsaild at 3ff7fd5e296 [xfs]
      #13 [37fffe0be10] kthread at 8a303b0
      #14 [37fffe0be68] __ret_from_fork at 89aeebc
      #15 [37fffe0be98] ret_from_fork at 9374b4a
      
      PID: 553      TASK: b7d0000           CPU: 1    COMMAND: "kworker/u10:3"
       #0 [37fffe83318] __schedule at 936d518
       #1 [37fffe83398] schedule at 936d83e
       #2 [37fffe833c8] io_schedule at 936d98a
       #3 [37fffe833f8] rq_qos_wait at 8ebae90
       #4 [37fffe83498] wbt_wait at 8ee5228
       #5 [37fffe834f0] __rq_qos_throttle at 8eba99e
       #6 [37fffe83528] blk_mq_submit_bio at 8eaa6c8
       #7 [37fffe835b8] __submit_bio_noacct at 8e98934
       #8 [37fffe83608] iomap_submit_ioend at 8da5f76
       #9 [37fffe83648] iomap_writepage_map at 8da6de4
      #10 [37fffe836f0] iomap_do_writepage at 8da716e
      #11 [37fffe83748] write_cache_pages at 8c298c6
      #12 [37fffe83878] iomap_writepages at 8da5fe6
      #13 [37fffe838a8] xfs_vm_writepages at 3ff7fd14732 [xfs]
      #14 [37fffe83950] do_writepages at 8c2ae3a
      #15 [37fffe839d0] __writeback_single_inode at 8d5c06c
      #16 [37fffe83a28] writeback_sb_inodes at 8d5c97c
      #17 [37fffe83b20] __writeback_inodes_wb at 8d5cd3a
      #18 [37fffe83b80] wb_writeback at 8d5d07e
      #19 [37fffe83c30] wb_workfn at 8d5e3c8
      #20 [37fffe83d10] process_one_work at 8a24872
      #21 [37fffe83d98] worker_thread at 8a2575e
      #22 [37fffe83e10] kthread at 8a303b0
      #23 [37fffe83e68] __ret_from_fork at 89aeebc
      #24 [37fffe83e98] ret_from_fork at 9374b4a
      

      Additional information

      1. This also reproduces for internal snapshots.
      2. At this point I'm not sure if this would happen on other archs. I have not found similar failures in our records for other archs. UPDATE Meina couldn't reproduce this on x86_64
      3. I tried running the loop for 4 minutes in a guest that was installed with the ext4 filesystem and did not reproduce the problem.
      4. The issue was hit by automated test "snapshot_delete.multiple_children.del_parent_snap" but only reproduces in 10% of the test case executions.
      5. Reproduces with RHEL 9.4 guest kernel version kernel-5.14.0-427.13.1.el9_4.s390x and qemu-kvm-8.2.0-11.el9_4.s390x libvirt-10.0.0-6.el9_4.s390x. In RHEL 9.4 external snapshots were improved to be fully supported so not considering this a Regression.
      6. I confirmed that the dd command succeeds repeatedly without issues when snapshot operations are omitted. For that I ran the dd commands without snapshot operations on the same setup.

        1. hang_on_delete_s5_virtqemud.log
          669 kB
        2. script.sh
          2 kB
        3. vm.xml
          3 kB

              kwolf@redhat.com Kevin Wolf
              smitterl@redhat.com Sebastian Mitterle
              IBM Confidential Group
              virt-maint virt-maint
              Sebastian Mitterle Sebastian Mitterle
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

                Created:
                Updated: