Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-14437

nested L2 VMs fail to start: 'KVM: entry failed, hardware error 0x80000021' after rebooting L1 VM by 'echo b > /proc/sysrq-trigger'

    • Critical
    • sst_virtualization_hwe
    • ssg_virtualization
    • 13
    • 26
    • QE ack, Dev ack
    • False
    • Hide

      None

      Show
      None
    • No
    • Red Hat Enterprise Linux
    • x86_64

      What were you trying to do that didn't work?

      Brief description:

      nested VMs scenario: RHEL9.2 host, RHEL9.2 L1 VM on it, 10 Cirros L2 VMs inside the L1 VM
      10 L2 VMs are set to autostart upon L1 VM start

      If we restart the L1 VM, with ~90% probability we get a paused L2 VM (1 of 10) and following complains in /var/log/libvirt/qemu/VM_NAME.log (on L1 level):

       

      ERROR cluster 597 refcount=0 reference=1
      ERROR cluster 601 refcount=0 reference=1
      Rebuilding refcount structure
      Repairing cluster 600 refcount=1 reference=0
      Repairing cluster 602 refcount=1 reference=0
      2023-10-23T10:25:42.465618Z qemu-kvm: warning: Machine type 'pc-i440fx-rhel7.6.0' is deprecated: machine types for previous major releases are deprecated
      KVM: entry failed, hardware error 0x80000021
      
      If you're running a guest on an Intel machine without unrestricted mode
      support, the failure can be most likely due to the guest entering an invalid
      state for Intel VT. For example, the guest maybe running in big real mode
      which is not supported on less recent Intel processors.
      
      EAX=febc0001 EBX=00000030 ECX=febc0001 EDX=00000cfc
      ESI=00000000 EDI=00000000 EBP=1efeb3f0 ESP=00006d8c
      EIP=000ec1fc EFL=00000086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
      ES =0000 00000000 00000000 00008000 DPL=0 Reserved
      CS =0000 00000000 00000000 00c09b00 DPL=0 CS32 [-RA]
      SS =0000 00000000 00000000 00c09300 DPL=0 DS   [-WA]
      DS =0000 00000000 00000000 00008000 DPL=0 Reserved
      FS =0000 00000000 00000000 00008000 DPL=0 Reserved
      GS =0000 00000000 00000000 00008000 DPL=0 Reserved
      LDT=0000 00000000 00000000 00008000 DPL=0 Reserved
      TR =0000 00000000 00000000 00008000 DPL=0 Reserved
      GDT=     00000000 00000000
      IDT=     00000000 00000000
      CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
      DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
      DR6=00000000ffff0ff0 DR7=0000000000000400
      EFER=0000000000000000
      Code=d8 0d 00 00 00 80 ba f8 0c 00 00 ef ba fc 0c 00 00 89 c8 ef <5b> 5e c3 56 53 89 d3 8b 15 f8 54 0f 00 85 d2 0f b7 c0 74 0c 01 da c1 e0 0c 01 c2 66 89 0a

       

      Please provide the package NVR for which bug is seen:

      kernel-5.14.0-284.30.1.el9_2.x86_64

      How reproducible:

      100% if you try several times
      90% it happens on the every first boot

      Steps to reproduce

      1. L1 VM # echo b > /proc/sysrq-trigger
      2. Wait until L1 VM restarts and L2 VMs are started.
      3. Check "virsh list" in L1 VM, find a "paused" VM.

      Expected results

      All L2 VMs are running.

      Actual results

      1 L2 VM of 10 VM is paused.

       

      Detailed description:

      Configuration:

      • host (L0): RHEL9.2, one L1 VM is running
      • L1 VM: RHEL9.2, 10 L2 VMs are running
      • L2 VMs: Guest OS: cirros-0.4.0-x86_64
        $ uname -a
        Linux cirros 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 GNU/Linux

      L0 (host):
      [root@rhel9test ~]# cat /etc/redhat-release
      Red Hat Enterprise Linux release 9.2 (Plow)

      [root@rhel9test ~]# uname -a
      Linux rhel9test.aci 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

      CPU:
      model name : Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
      (20 threads)

      [root@rhel9test ~]# virsh list
      Id Name State
      --------------------------
      1 nestedrh running

       

      L1: VM config: CPUs: 4, MEM: 32Gb

      [root@localhost ~]# cat /etc/redhat-release
      Red Hat Enterprise Linux release 9.2 (Plow)

      [root@localhost ~]# uname -a
      Linux localhost.localdomain 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

      [root@localhost ~]# virsh list
      Id Name State
      ------------------------
      1 test1 paused
      2 test9 running
      3 test4 running
      4 test3 running
      5 test8 running
      6 test5 running
      7 test6 running
      8 test2 running
      9 test7 running
      10 test10 running

      L2: VM config: CPUs: 2, MEM: 512 Mb

      Guest OS: cirros-0.4.0-x86_64

      $ cat /etc/os-release
      NAME=Buildroot
      VERSION=2015.05-g31af4e3-dirty
      ID=buildroot
      VERSION_ID=2015.05
      PRETTY_NAME="Buildroot 2015.05"

      $ uname -a
      Linux cirros 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 GNU/Linux

       

      How do i reproduce it (easy!):

      [L1 VM]# echo b > /proc/sysrq-trigger

      ~90% probability after L1 VM restart one of L2 VMs will be in "paused" state
      with following complains in logs:

      [root@localhost ~]# virsh list
      Id Name State
      ------------------------
      1 test3 paused
      ...

      L1 dmesg:
      [ 5.509169] virbr0: port 1(vnet0) entered listening state
      [ 5.902969] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.

      L1 journalctl:

      Oct 23 13:25:42 localhost.localdomain systemd[1]: Started Virtual Machine qemu-1-test3.
      Oct 23 13:25:42 localhost.localdomain virtqemud[1810]: 2023-10-23 10:25:42.442+0000: 1810: info : libvirt version: 9.0.0, package: 10.3.el9_2 (Red Hat, Inc. <http://bugzilla.redhat.com
      /bugzilla>, 2023-08-24-06:08:50, )
      Oct 23 13:25:42 localhost.localdomain virtqemud[1810]: 2023-10-23 10:25:42.442+0000: 1810: info : hostname: localhost.localdomain
      Oct 23 13:25:42 localhost.localdomain virtqemud[1810]: 2023-10-23 10:25:42.442+0000: 1810: warning : virSecurityValidateTimestamp:205 : Invalid XATTR timestamp detected on /var/lib/lib
      virt/images/test3.qcow2 secdriver=dac
      Oct 23 13:25:42 localhost.localdomain kernel: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
      Oct 23 13:25:43 localhost.localdomain virtqemud[1855]: 2023-10-23 10:25:43.329+0000: 1855: info : libvirt version: 9.0.0, package: 10.3.el9_2 (Red Hat, Inc. <http://bugzilla.redhat.com
      /bugzilla>, 2023-08-24-06:08:50, )

       

      L1 /var/log/libvirt/qemu/test3.log:

      char device redirected to /dev/pts/0 (label charserial0)
      ERROR cluster 597 refcount=0 reference=1
      ERROR cluster 601 refcount=0 reference=1
      Rebuilding refcount structure
      Repairing cluster 600 refcount=1 reference=0
      Repairing cluster 602 refcount=1 reference=0
      2023-10-23T10:25:42.465618Z qemu-kvm: warning: Machine type 'pc-i440fx-rhel7.6.0' is deprecated: machine types for previous major releases are deprecated
      KVM: entry failed, hardware error 0x80000021
      
      If you're running a guest on an Intel machine without unrestricted mode
      support, the failure can be most likely due to the guest entering an invalid
      state for Intel VT. For example, the guest maybe running in big real mode
      which is not supported on less recent Intel processors.
      
      EAX=febc0001 EBX=00000030 ECX=febc0001 EDX=00000cfc
      ESI=00000000 EDI=00000000 EBP=1efeb3f0 ESP=00006d8c
      EIP=000ec1fc EFL=00000086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
      ES =0000 00000000 00000000 00008000 DPL=0 Reserved
      CS =0000 00000000 00000000 00c09b00 DPL=0 CS32 [-RA]
      SS =0000 00000000 00000000 00c09300 DPL=0 DS   [-WA]
      DS =0000 00000000 00000000 00008000 DPL=0 Reserved
      FS =0000 00000000 00000000 00008000 DPL=0 Reserved
      GS =0000 00000000 00000000 00008000 DPL=0 Reserved
      LDT=0000 00000000 00000000 00008000 DPL=0 Reserved
      TR =0000 00000000 00000000 00008000 DPL=0 Reserved
      GDT=     00000000 00000000
      IDT=     00000000 00000000
      CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
      DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
      DR6=00000000ffff0ff0 DR7=0000000000000400
      EFER=0000000000000000
      Code=d8 0d 00 00 00 80 ba f8 0c 00 00 ef ba fc 0c 00 00 89 c8 ef <5b> 5e c3 56 53 89 d3 8b 15 f8 54 0f 00 85 d2 0f b7 c0 74 0c 01 da c1 e0 0c 01 c2 66 89 0a

        1. FailToReboot.png
          FailToReboot.png
          37 kB
        2. L1_journalctl.dump
          183 kB
        3. L1_nestedrh.xml
          8 kB
        4. L2_test3.log
          6 kB
        5. L2_test3.xml
          4 kB
        6. test_run.sh
          0.8 kB
        7. test_run.sysrq.sh
          0.9 kB

            bdas@redhat.com Bandan Das
            khorenko Konstantin Khorenko
            virt-maint virt-maint
            Yanbin Duan Yanbin Duan
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: