-
Bug
-
Resolution: Done
-
Normal
-
rhel-9.2.0
-
None
-
Critical
-
rhel-sst-virtualization-hwe
-
ssg_virtualization
-
13
-
26
-
None
-
QE ack, Dev ack
-
False
-
-
No
-
Red Hat Enterprise Linux
-
None
-
Pass
-
-
RegressionOnly
-
x86_64
-
None
What were you trying to do that didn't work?
Brief description:
nested VMs scenario: RHEL9.2 host, RHEL9.2 L1 VM on it, 10 Cirros L2 VMs inside the L1 VM
10 L2 VMs are set to autostart upon L1 VM start
If we restart the L1 VM, with ~90% probability we get a paused L2 VM (1 of 10) and following complains in /var/log/libvirt/qemu/VM_NAME.log (on L1 level):
ERROR cluster 597 refcount=0 reference=1 ERROR cluster 601 refcount=0 reference=1 Rebuilding refcount structure Repairing cluster 600 refcount=1 reference=0 Repairing cluster 602 refcount=1 reference=0 2023-10-23T10:25:42.465618Z qemu-kvm: warning: Machine type 'pc-i440fx-rhel7.6.0' is deprecated: machine types for previous major releases are deprecated KVM: entry failed, hardware error 0x80000021 If you're running a guest on an Intel machine without unrestricted mode support, the failure can be most likely due to the guest entering an invalid state for Intel VT. For example, the guest maybe running in big real mode which is not supported on less recent Intel processors. EAX=febc0001 EBX=00000030 ECX=febc0001 EDX=00000cfc ESI=00000000 EDI=00000000 EBP=1efeb3f0 ESP=00006d8c EIP=000ec1fc EFL=00000086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 00000000 00008000 DPL=0 Reserved CS =0000 00000000 00000000 00c09b00 DPL=0 CS32 [-RA] SS =0000 00000000 00000000 00c09300 DPL=0 DS [-WA] DS =0000 00000000 00000000 00008000 DPL=0 Reserved FS =0000 00000000 00000000 00008000 DPL=0 Reserved GS =0000 00000000 00000000 00008000 DPL=0 Reserved LDT=0000 00000000 00000000 00008000 DPL=0 Reserved TR =0000 00000000 00000000 00008000 DPL=0 Reserved GDT= 00000000 00000000 IDT= 00000000 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=d8 0d 00 00 00 80 ba f8 0c 00 00 ef ba fc 0c 00 00 89 c8 ef <5b> 5e c3 56 53 89 d3 8b 15 f8 54 0f 00 85 d2 0f b7 c0 74 0c 01 da c1 e0 0c 01 c2 66 89 0a
Please provide the package NVR for which bug is seen:
kernel-5.14.0-284.30.1.el9_2.x86_64
How reproducible:
100% if you try several times
90% it happens on the every first boot
Steps to reproduce
- L1 VM # echo b > /proc/sysrq-trigger
- Wait until L1 VM restarts and L2 VMs are started.
- Check "virsh list" in L1 VM, find a "paused" VM.
Expected results
All L2 VMs are running.
Actual results
1 L2 VM of 10 VM is paused.
Detailed description:
Configuration:
- host (L0): RHEL9.2, one L1 VM is running
- L1 VM: RHEL9.2, 10 L2 VMs are running
- L2 VMs: Guest OS: cirros-0.4.0-x86_64
$ uname -a
Linux cirros 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 GNU/Linux
L0 (host):
[root@rhel9test ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)
[root@rhel9test ~]# uname -a
Linux rhel9test.aci 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
CPU:
model name : Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
(20 threads)
[root@rhel9test ~]# virsh list
Id Name State
--------------------------
1 nestedrh running
L1: VM config: CPUs: 4, MEM: 32Gb
[root@localhost ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)
[root@localhost ~]# uname -a
Linux localhost.localdomain 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# virsh list
Id Name State
------------------------
1 test1 paused
2 test9 running
3 test4 running
4 test3 running
5 test8 running
6 test5 running
7 test6 running
8 test2 running
9 test7 running
10 test10 running
L2: VM config: CPUs: 2, MEM: 512 Mb
Guest OS: cirros-0.4.0-x86_64
$ cat /etc/os-release
NAME=Buildroot
VERSION=2015.05-g31af4e3-dirty
ID=buildroot
VERSION_ID=2015.05
PRETTY_NAME="Buildroot 2015.05"
$ uname -a
Linux cirros 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 GNU/Linux
How do i reproduce it (easy!):
[L1 VM]# echo b > /proc/sysrq-trigger
~90% probability after L1 VM restart one of L2 VMs will be in "paused" state
with following complains in logs:
[root@localhost ~]# virsh list Id Name State ------------------------ 1 test3 paused ...
L1 dmesg:
[ 5.509169] virbr0: port 1(vnet0) entered listening state
[ 5.902969] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
L1 journalctl:
Oct 23 13:25:42 localhost.localdomain systemd[1]: Started Virtual Machine qemu-1-test3.
Oct 23 13:25:42 localhost.localdomain virtqemud[1810]: 2023-10-23 10:25:42.442+0000: 1810: info : libvirt version: 9.0.0, package: 10.3.el9_2 (Red Hat, Inc. <http://bugzilla.redhat.com
/bugzilla>, 2023-08-24-06:08:50, )
Oct 23 13:25:42 localhost.localdomain virtqemud[1810]: 2023-10-23 10:25:42.442+0000: 1810: info : hostname: localhost.localdomain
Oct 23 13:25:42 localhost.localdomain virtqemud[1810]: 2023-10-23 10:25:42.442+0000: 1810: warning : virSecurityValidateTimestamp:205 : Invalid XATTR timestamp detected on /var/lib/lib
virt/images/test3.qcow2 secdriver=dac
Oct 23 13:25:42 localhost.localdomain kernel: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
Oct 23 13:25:43 localhost.localdomain virtqemud[1855]: 2023-10-23 10:25:43.329+0000: 1855: info : libvirt version: 9.0.0, package: 10.3.el9_2 (Red Hat, Inc. <http://bugzilla.redhat.com
/bugzilla>, 2023-08-24-06:08:50, )
L1 /var/log/libvirt/qemu/test3.log:
char device redirected to /dev/pts/0 (label charserial0) ERROR cluster 597 refcount=0 reference=1 ERROR cluster 601 refcount=0 reference=1 Rebuilding refcount structure Repairing cluster 600 refcount=1 reference=0 Repairing cluster 602 refcount=1 reference=0 2023-10-23T10:25:42.465618Z qemu-kvm: warning: Machine type 'pc-i440fx-rhel7.6.0' is deprecated: machine types for previous major releases are deprecated KVM: entry failed, hardware error 0x80000021 If you're running a guest on an Intel machine without unrestricted mode support, the failure can be most likely due to the guest entering an invalid state for Intel VT. For example, the guest maybe running in big real mode which is not supported on less recent Intel processors. EAX=febc0001 EBX=00000030 ECX=febc0001 EDX=00000cfc ESI=00000000 EDI=00000000 EBP=1efeb3f0 ESP=00006d8c EIP=000ec1fc EFL=00000086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 00000000 00008000 DPL=0 Reserved CS =0000 00000000 00000000 00c09b00 DPL=0 CS32 [-RA] SS =0000 00000000 00000000 00c09300 DPL=0 DS [-WA] DS =0000 00000000 00000000 00008000 DPL=0 Reserved FS =0000 00000000 00000000 00008000 DPL=0 Reserved GS =0000 00000000 00000000 00008000 DPL=0 Reserved LDT=0000 00000000 00000000 00008000 DPL=0 Reserved TR =0000 00000000 00000000 00008000 DPL=0 Reserved GDT= 00000000 00000000 IDT= 00000000 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=d8 0d 00 00 00 80 ba f8 0c 00 00 ef ba fc 0c 00 00 89 c8 ef <5b> 5e c3 56 53 89 d3 8b 15 f8 54 0f 00 85 d2 0f b7 c0 74 0c 01 da c1 e0 0c 01 c2 66 89 0a