-
Bug
-
Resolution: Won't Do
-
Major
-
rhel-9.3.0
-
None
-
Important
-
rhel-sst-virtualization
-
ssg_virtualization
-
5
-
False
-
-
None
-
Red Hat Enterprise Linux
-
None
-
None
-
None
-
If docs needed, set a value
-
-
x86_64
-
None
Considering it is a customer bug and can be reproduce on rhel9.3. So cloned it.
- rpm -q qemu-kvm
qemu-kvm-8.0.0-9.el9.x86_64 - uname -r
5.14.0-345.el9.x86_64
+++ This bug was initially created as a clone of Bug #2054781 +++
Description of problem:
- Windows instances crash time to time
- From windows OS they had seen power loss log written to all windows crashes
- Server certificate RHOSP16 and RHEL8 - https://catalog.redhat.com/hardware/servers/detail/2941651
- March to June 2021 the env had OSP upgrade from OSP 13 to 16 & Firmware BIOS update to Computes
- Issue is on Windows 2012 & 2016, More reports are on 2016
Version-Release number of selected component (if applicable):
[redhat-release] Red Hat Enterprise Linux release 8.2 (Ootpa)
[rhosp-release] Red Hat OpenStack Platform release 16.1.3 GA (Train)
- the qemu-kvm and libvirtd is containerized, and this host is using :
"url": "https://access.redhat.com/containers/#/registry.access.redhat.com/rhosp16/openstack-nova-libvirt/images/16.1.3-7.1614767861",
Which corresponds to this:
https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-nova-libvirt/5de6c2ddbed8bd164a0c1bbf?tag=16.1.3-7.1614767861&push_date=1615227731000&container-tabs=packages
So the qemu-kvm and libvirt versions: - qemu-kvm-4.2.0-29.module+el8.2.1+9791+7d72b149.6.x86_64
- libvirt-daemon-6.0.0-25.5.module+el8.2.1+8680+ea98947b.x86_64
How reproducible:
We didn't find a reason for reproduce, but it's happen randomly
Additional info:
gdb -e /usr/libexec/qemu-kvm -c ./core.qemu-kvm.107.5c1789ec0e454a61a539f2120495cc87.340182.1644135667000000
BFD: warning: /home/fdelorey/Desktop/./core.qemu-kvm.107.5c1789ec0e454a61a539f2120495cc87.340182.1644135667000000 is truncated: expected core file size >= 34764460032, found: 2147483648
MANY LINES DELETED:
Failed to read a valid object file image from memory.
Core was generated by `/usr/libexec/qemu-kvm -name guest=instance-00005xxx,debug-threads=on -S -object'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007ff84c48470f in ?? ()
[Current thread is 1 (LWP 340200)]
(gdb) bt
#0 0x00007ff84c48470f in ?? ()
Backtrace stopped: Cannot access memory at address 0x7ff83f7fd110
— Additional comment from Luigi Tamagnone on 2022-02-15 17:00:57 UTC —
Additional info:
- Cu configure Capturing core dumps for QEMU[1].
- We have a coredump from a Windows instance crashed: core.qemu-kvm.107.5c1789ec0e454a61a539f2120495cc87.340182.1644135667000000.lz4 - Feb 06 2022
- compute SOSreport sosreport-osrtpz81011-sosreport-osrtpz81011-03112469-2022-02-07-ivfiiij.tar.xz
- Instance information:
~~~
(overcloud) [stack@osdirrtpz801 ~]$ openstack server show 773b0d15-a735-43bb-82cb-fdefcad28ea3
------------------------------------------------------------------------------------------------------------------------------------------+Field Value ------------------------------------
------------------------------------------------------------------------------------------------------+OS-EXT-SRV-ATTR:hostname vc2crtp2435874p OS-EXT-SRV-ATTR:hypervisor_hostname osrtpz81011.localdomain OS-EXT-SRV-ATTR:instance_name instance-00005f1d OS-EXT-STS:power_state Running OS-SRV-USG:launched_at 2021-10-10T07:21:29.000000 host_status UP id 773b0d15-a735-43bb-82cb-fdefcad28ea3 name vc2crtp2435874p updated 2022-02-06T11:45:52Z ~~~
- The VM in question is 'instance-00005f1d'.
- crash from qemu log 0110-qemulogs_osrtpz81011.tar.gz/var/log/libvirt/qemu/instance-00005f1d.log
~~~
2022-02-06 08:21:17.498+0000: shutting down, reason=crashed
~~~ - At the same timestamp, there is libvirtd logs that indicate:
~~~
$ cat var/log/containers/libvirt/libvirtd.log.1
2022-02-06 08:21:17.296+0000: 324502: error : qemuMonitorIO:620 : internal error: End of file from qemu monitor
2022-02-06 08:21:17.525+0000: 332200: error : virProcessRunInFork:1161 : internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor
2022-02-06 08:21:17.529+0000: 332200: error : virProcessRunInFork:1161 : internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor
2022-02-06 08:21:17.529+0000: 332200: warning : qemuBlockRemoveImageMetadata:2774 : Unable to remove disk metadata on vm instance-00005f1d from /var/lib/nova/mnt/0cffc2e1851ef7b9185d0b1702bd3d8e/volume-9b2c8658-4b54-409d-93eb-f934a8540ceb (disk target vda)
2022-02-06 08:21:17.532+0000: 332200: error : virProcessRunInFork:1161 : internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor
2022-02-06 08:21:17.536+0000: 332200: error : virProcessRunInFork:1161 : internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor
2022-02-06 08:21:17.537+0000: 332200: warning : qemuBlockRemoveImageMetadata:2774 : Unable to remove disk metadata on vm instance-00005f1d from /var/lib/nova/mnt/0cffc2e1851ef7b9185d0b1702bd3d8e/volume-edbf8607-742f-48b8-9386-174317095545 (disk target vdb)
~~~
core dump info
~~~
[root@osrtpz81011 coredump]# coredumpctl info 340182
PID: 340182 (qemu-kvm)
UID: 107 (qemu)
GID: 107 (qemu)
Signal: 6 (ABRT)
Timestamp: Sun 2022-02-06 08:21:07 UTC (1 day 9h ago)
Command Line: /usr/libexec/qemu-kvm -name guest=instance-00005f1d,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/v>
Executable: /usr/libexec/qemu-kvm
Control Group: /
Slice: -.slice
Boot ID: 5c1789ec0e454a61a539f2120495cc87
Machine ID: baaf5b74e7614e1383abdba4731cd651
Hostname: osrtpz81011
Storage: /var/lib/systemd/coredump/core.qemu-kvm.107.5c1789ec0e454a61a539f2120495cc87.340182.1644135667000000.lz4 (truncated)
Message: Process 340182 (qemu-kvm) of user 107 dumped core.
Stack trace of thread 340200:
#0 0x00007ff84c48470f n/a (n/a)
~~~
- we have also another dump truncate core.qemu-kvm.107.5c1789ec0e454a61a539f2120495cc87.148169.1644824003000000.lz4
- We asked to change[1] ProcessSizeMax= and ExternalSizeMax= in /etc/systemd/coredump.conf to a higher value, to have a full dump if what we have is not enough
— Additional comment from RHEL Program Management on 2022-02-17 19:57:29 UTC —
pm_ack is no longer used for this product. The flag has been reset.
See https://issues.redhat.com/browse/PTT-1821 for additional details.
— Additional comment from RHEL Program Management on 2022-02-17 19:57:29 UTC —
This bug was reopened or transitioned from a non-RHEL to RHEL product. The stale date has been reset to +6 months.
— Additional comment from Guo, Zhiyi on 2022-02-21 08:48:56 UTC —
qemu cli of VM and crash reason found from customer's latest sos report:
2022-02-03 06:10:18.399+0000: starting up libvirt version: 6.0.0, package: 25.5.module+el8.2.1+8680+ea98947b (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-11-06-13:17:30, ), qemu version: 4.2.0qemu-kvm-4.2.0-29.module+el8.2.1+9791+7d72b149.6, kernel: 4.18.0-193.29.1.el8_2.x86_64, hostname: osrtpz81011.localdomain
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
HOME=/var/lib/libvirt/qemu/domain-5-instance-00005f1d \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-5-instance-00005f1d/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-5-instance-00005f1d/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-5-instance-00005f1d/.config \
QEMU_AUDIO_DRV=none \
/usr/libexec/qemu-kvm \
-name guest=instance-00005f1d,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-5-instance-00005f1d/master-key.aes \
-machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=on \
-cpu Haswell-noTSX,vme=on,f16c=on,rdrand=on,hypervisor=on,arat=on,xsaveopt=on,abm=on \
-m 32768 \
-overcommit mem-lock=off \
-smp 6,sockets=6,dies=1,cores=1,threads=1 \
-uuid 773b0d15-a735-43bb-82cb-fdefcad28ea3 \
-smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=20.4.1-1.20200917173450.el8ost,serial=773b0d15-a735-43bb-82cb-fdefcad28ea3,uuid=773b0d15-a735-43bb-82cb-fdefcad28ea3,family=Virtual Machine' \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=37,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-blockdev '{"driver":"file","filename":"/var/lib/nova/mnt/0cffc2e1851ef7b9185d0b1702bd3d8e/volume-9b2c8658-4b54-409d-93eb-f934a8540ceb","aio":"native","node-name":"libvirt-2-storage","cache":
,"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":
,"driver":"raw","file":"libvirt-2-storage"}' \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=libvirt-2-format,id=virtio-disk0,bootindex=1,write-cache=on,serial=9b2c8658-4b54-409d-93eb-f934a8540ceb \
-blockdev '{"driver":"file","filename":"/var/lib/nova/mnt/0cffc2e1851ef7b9185d0b1702bd3d8e/volume-edbf8607-742f-48b8-9386-174317095545","aio":"native","node-name":"libvirt-1-storage","cache":
,"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":
,"driver":"raw","file":"libvirt-1-storage"}' \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=libvirt-1-format,id=virtio-disk1,write-cache=on,serial=edbf8607-742f-48b8-9386-174317095545 \
-netdev tap,fd=39,id=hostnet0,vhost=on,vhostfd=40 \
-device virtio-net-pci,rx_queue_size=512,host_mtu=9000,netdev=hostnet0,id=net0,mac=00:16:3e:09:55:49,bus=pci.0,addr=0x3 \
-add-fd set=3,fd=42 \
-chardev pty,id=charserial0,logfile=/dev/fdset/3,logappend=on \
-device isa-serial,chardev=charserial0,id=serial0 \
-device usb-tablet,id=input0,bus=usb.0,port=1 \
-vnc 192.168.81.168:3 \
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
-incoming defer \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
char device redirected to /dev/pts/9 (label charserial0)
2022-02-03T06:10:18.581568Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead
2022-02-03T06:10:47.654624Z qemu-kvm: warning: TSC frequency mismatch between VM (2893202 kHz) and host (2199997 kHz), and TSC scaling unavailable
2022-02-03T06:10:47.654888Z qemu-kvm: warning: TSC frequency mismatch between VM (2893202 kHz) and host (2199997 kHz), and TSC scaling unavailable
2022-02-03T06:10:47.655019Z qemu-kvm: warning: TSC frequency mismatch between VM (2893202 kHz) and host (2199997 kHz), and TSC scaling unavailable
2022-02-03T06:10:47.655172Z qemu-kvm: warning: TSC frequency mismatch between VM (2893202 kHz) and host (2199997 kHz), and TSC scaling unavailable
2022-02-03T06:10:47.655284Z qemu-kvm: warning: TSC frequency mismatch between VM (2893202 kHz) and host (2199997 kHz), and TSC scaling unavailable
2022-02-03T06:10:47.655391Z qemu-kvm: warning: TSC frequency mismatch between VM (2893202 kHz) and host (2199997 kHz), and TSC scaling unavailable
qemu-kvm: /builddir/build/BUILD/qemu-4.2.0/hw/rtc/mc146818rtc.c:201: periodic_timer_update: Assertion `lost_clock >= 0' failed.
2022-02-06 08:21:17.498+0000: shutting down, reason=crashed
So issue looks like a qemu crash caused by RTC timer
— Additional comment from Yanhui Ma on 2022-02-22 06:17:31 UTC —
I tried some steps but didn't reproduce the issue by following steps:
1. Installed one rhel8.2.1 host and installed the packages and win2016 guest used by customer:
- rpm -q qemu-kvm
qemu-kvm-4.2.0-29.module+el8.2.1+9791+7d72b149.6.x86_64 - uname -r
4.18.0-193.29.1.el8_2.x86_64
2. Booted the win2016 guest for a whole night:
/usr/libexec/qemu-kvm \
-name guest=instance-00005f1d,debug-threads=on \
-S \
-machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=on \
-cpu SandyBridge-IBRS,vme=on,f16c=on,rdrand=on,hypervisor=on,arat=on,xsaveopt=on,abm=on \
-m 32768 \
-overcommit mem-lock=off \
-smp 6,sockets=6,dies=1,cores=1,threads=1 \
-uuid 773b0d15-a735-43bb-82cb-fdefcad28ea3 \
-smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=20.4.1-1.20200917173450.el8ost,serial=773b0d15-a735-43bb-82cb-fdefcad28ea3,uuid=773b0d15-a735-43bb-82cb-fdefcad28ea3,family=Virtual Machine' \
-no-user-config \
-nodefaults \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-blockdev '{"driver":"file","filename":"/home/win2016-64-virtio.raw","aio":"native","node-name":"libvirt-2-storage","cache":
,"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":
,"driver":"raw","file":"libvirt-2-storage"}' \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=libvirt-2-format,id=virtio-disk0,bootindex=1,write-cache=on,serial=9b2c8658-4b54-409d-93eb-f934a8540ceb \
-netdev tap,id=hostnet0,vhost=on \
-device virtio-net-pci,rx_queue_size=512,host_mtu=9000,netdev=hostnet0,id=net0,mac=00:16:3e:09:55:49,bus=pci.0,addr=0x3 \
-device usb-tablet,id=input0,bus=usb.0,port=1 \
-vnc :0 \
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 \
-sandbox on \
-msg timestamp=on -monitor stdio
3. Then changed the guest time backwards/forwards, after that, rebooted the guest.
— Additional comment from Yanhui Ma on 2022-03-10 03:02:47 UTC —
I ran all our windows timer device cases with the test environment on comment 5 and the same qemu cmd line as customer.
Still can't reproduce the issue.
Summary:
Finshed=25, PASS=25
And here is the related code, does anyone have any suggestions on how to reproduce the bug?
190 /*
191 * if the periodic timer's update is due to period re-configuration,
192 * we should count the clock since last interrupt.
193 */
194 if (old_period && period_change)
— Additional comment from Luigi Tamagnone on 2022-03-10 17:36:56 UTC —
Hi,
There is BUG 1996111 that it's different, but also it seems memory issue. Maybe they are correlated, but we didn't find the RCA also on that.
— Additional comment from Yvugenfi@redhat.com on 2022-03-22 09:53:02 UTC —
Can we instruct the customer to use hv_stimer enlightenment, so the RTC will not be used on the guest?
In any case hv_stimer should be used instead fo RTC for better performance.
— Additional comment from Artom Lifshitz on 2022-03-25 13:32:33 UTC —
(In reply to Yvugenfi@redhat.com from comment #8)
> Can we instruct the customer to use hv_stimer enlightenment, so the RTC will
> not be used on the guest?
> In any case hv_stimer should be used instead fo RTC for better performance.
As this is an OpenStack Nova guest, the customer has no direct control over libvirt and/or the qemu command line. As a Nova developer, I don't know what this "hv_stimer enlightenment" is, but for Nova to expose it to our users, we would need libvirt to first enable it in its domain XML. As far as I can tell from reading [1], this is currently not the case, so it would first be a libvirt RFE.
[1] https://libvirt.org/formatdomain.html
— Additional comment from Artom Lifshitz on 2022-03-25 14:03:26 UTC —
(In reply to Artom Lifshitz from comment #9)
> (In reply to Yvugenfi@redhat.com from comment #8)
> > Can we instruct the customer to use hv_stimer enlightenment, so the RTC will
> > not be used on the guest?
> > In any case hv_stimer should be used instead fo RTC for better performance.
>
> As this is an OpenStack Nova guest, the customer has no direct control over
> libvirt and/or the qemu command line. As a Nova developer, I don't know what
> this "hv_stimer enlightenment" is, but for Nova to expose it to our users,
> we would need libvirt to first enable it in its domain XML. As far as I can
> tell from reading [1], this is currently not the case, so it would first be
> a libvirt RFE.
>
> [1] https://libvirt.org/formatdomain.html
I stand correct, libvirt supports the hv_stimer enlightenment [2], we just need to add it to Nova.
[2] https://libvirt.org/formatdomain.html#hypervisor-features
— Additional comment from Yanhui Ma on 2022-04-11 08:55:55 UTC —
(In reply to Yanhui Ma from comment #6)
> I ran all our windows timer device cases with the test environment on
> comment 5 and the same qemu cmd line as customer.
> Still can't reproduce the issue.
> Summary:
> Finshed=25, PASS=25
>
> And here is the related code, does anyone have any suggestions on how to
> reproduce the bug?
>
>
> 190 /*
>
> 191 * if the periodic timer's update is due to period
> re-configuration,
> 192 * we should count the clock since last interrupt.
>
> 193 */
>
> 194 if (old_period && period_change)
Hi Yan,
Could you please help check the comment 6? Is this the related code? Any suggestions to reproduce it?
— Additional comment from Yvugenfi@redhat.com on 2022-04-11 14:04:40 UTC —
(In reply to Yanhui Ma from comment #11)
> (In reply to Yanhui Ma from comment #6)
> > I ran all our windows timer device cases with the test environment on
> > comment 5 and the same qemu cmd line as customer.
> > Still can't reproduce the issue.
> > Summary:
> > Finshed=25, PASS=25
> >
> > And here is the related code, does anyone have any suggestions on how to
> > reproduce the bug?
> >
> >
> > 190 /*
> >
> > 191 * if the periodic timer's update is due to period
> > re-configuration,
> > 192 * we should count the clock since last interrupt.
> >
> > 193 */
> >
> > 194 if (old_period && period_change)
>
> Hi Yan,
>
> Could you please help check the comment 6? Is this the related code? Any
> suggestions to reproduce it?
Hi Yanhui,
It might be that writing an application that will play with timeBeginPeriod and timeEndPeriod can trigger the change in Windows timer resolution.
https://docs.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod
https://docs.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timeendperiod
Some additional discussion here: https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-great-rule-change/
— Additional comment from Yanhui Ma on 2022-04-14 06:52:03 UTC —
Hi Luigi,
I just saw the related case https://access.redhat.com/support/cases/03112469 has been closed, right? So want to confirm whether the customer issue has been solved. Should we still open the bug?
— Additional comment from Luigi Tamagnone on 2022-04-14 08:17:08 UTC —
GSS tolls close the Case:
08/04/2022 7.15 GSS Tools Changed Internal Status from Waiting on Engineering to Closed.
I'm not sure why. I'll reopen it and ask it.
— Additional comment from Luigi Tamagnone on 2022-04-14 10:22:44 UTC —
Hi,
I found a KCS[1] that talks about Timer tree corruption leads to missing wakeup and system freeze.
Could be bound with this issue?
[1] https://access.redhat.com/solutions/6286391
https://bugzilla.redhat.com/show_bug.cgi?id=1984798
https://access.redhat.com/errata/RHSA-2021:4871
— Additional comment from Yvugenfi@redhat.com on 2022-05-02 08:39:08 UTC —
(In reply to Luigi Tamagnone from comment #15)
> Hi,
>
> I found a KCS[1] that talks about Timer tree corruption leads to missing
> wakeup and system freeze.
> Could be bound with this issue?
>
>
> [1] https://access.redhat.com/solutions/6286391
> https://bugzilla.redhat.com/show_bug.cgi?id=1984798
> https://access.redhat.com/errata/RHSA-2021:4871
Hi Luigi,
It looks like different issue (although similar symptoms, just in BZ above this is Linux kernel bug, and in our case Windows changing timer quantum causing a bug in QEMU).
— Additional comment from Guo, Zhiyi on 2022-06-09 06:52:22 UTC —
I'm able to reproduce crash at qemu-kvm: /builddir/build/BUILD/qemu-4.2.0/hw/rtc/mc146818rtc.c:201: periodic_timer_update: Assertion `lost_clock >= 0' failed with following steps:
1.boot a windows 2016 VM with below qemu cli:
/usr/libexec/qemu-kvm \
-name guest=win2016,debug-threads=on \
-machine pc,accel=kvm,usb=off,dump-guest-core=off \
-cpu Broadwell-IBRS \
-m 8192 \
-smp 4,sockets=1,cores=4,threads=1 \
-uuid f4c701cb-e3c7-4625-8ace-91b4b25f17dc \
-no-user-config \
-nodefaults \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-boot strict=on \
-device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 \
-device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x4 \
-device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x4.0x1 \
-device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x4.0x2 \
-hda /home/win2016.qcow2 \
-netdev tap,id=hostnet0,vhost=on \
-device e1000e,netdev=hostnet0,id=net0,mac=52:54:00:80:d0:19,bus=pci.0,addr=0x3 \
-vnc 0.0.0.0:0 \
-device VGA,id=video0,bus=pci.0,addr=0x2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 \
-object rng-random,id=objrng0,filename=/dev/urandom \
-device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 \
-msg timestamp=on \
2. Inside VM execute the windows application from a power shell terminal:
.\rtctest.exe 1
The content of this application(this is the same application used from (https://qemu-devel.nongnu.narkive.com/9Qzleow1/patch-0-5-mc146818rtc-fix-windows-vm-clock-faster, windows VM need to have visual studio 2022 with C++ development tools installed for providing dll dependencies):
#pragma comment(lib, "winmm")
#include <stdio.h>
#include <windows.h>
#define SWITCH_PEROID 13
int main(int argc, char* argv[])
{
if (argc != 2)
else
{
DWORD internal = atoi((char*)argv[1]);
DWORD count = 0;
while (1)
{
count++;
timeBeginPeriod(1);
DWORD start = timeGetTime();
Sleep(internal);
timeEndPeriod(1);
if ((count % SWITCH_PEROID) == 0)
}
}
return 0;
}
3. On host, execute:
- while [ 1 ];do hwclock --systohc; hwclock --hctosys;done
qemu will crash after a while with trace:
(gdb) bt
#0 0x00007f648121770f in raise () at /lib64/libc.so.6
#1 0x00007f6481201b25 in abort () at /lib64/libc.so.6
#2 0x00007f64812019f9 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
#3 0x00007f648120fcc6 in .annobin_assert.c_end () at /lib64/libc.so.6
#4 0x000055b227d89f25 in periodic_timer_update (s=0x55b229b44800, current_time=<optimized out>, old_period=32, period_change=<optimized out>)
at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/hw/rtc/mc146818rtc.c:201
#5 0x000055b227d8afcb in cmos_ioport_write (opaque=0x55b229b44800, addr=<optimized out>, data=<optimized out>, size=<optimized out>)
at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/hw/rtc/mc146818rtc.c:515
#6 0x000055b227d3fd17 in memory_region_write_accessor
(mr=<optimized out>, addr=<optimized out>, value=<optimized out>, size=<optimized out>, shift=<optimized out>, mask=<optimized out>, attrs=...)
at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/memory.c:484
#7 0x000055b227d3df4e in access_with_adjusted_size
(addr=addr@entry=1, value=value@entry=0x7f64737fd508, size=size@entry=1, access_size_min=<optimized out>, access_size_max=<optimized out>, access_fn=
0x55b227d3fca0 <memory_region_write_accessor>, mr=0x55b229b44890, attrs=...)
at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/memory.c:545
#8 0x000055b227d41ebc in memory_region_dispatch_write (mr=0x55b229b44890, addr=1, data=<optimized out>, op=<optimized out>, attrs=...)
at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/memory.c:1480
#9 0x000055b227ceeeb7 in flatview_write_continue
(fv=0x7f645842f360, addr=113, attrs=..., buf=0x7f6486668000 "*", <incomplete sequence \314>, len=1, addr1=<optimized out>, l=<optimized out>, mr=0x55b229b44890) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/include/qemu/host-utils.h:164
#10 0x000055b227cef0d6 in flatview_write (fv=0x7f645842f360, addr=113, attrs=..., buf=0x7f6486668000 "*", <incomplete sequence \314>, len=1)
at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/exec.c:3169
#11 0x000055b227cf35ef in address_space_write () at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/exec.c:3259
#12 0x000055b227d50f74 in kvm_handle_io (count=1, size=1, direction=<optimized out>, data=<optimized out>, attrs=..., port=113)
at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/accel/kvm/kvm-all.c:2130
#13 0x000055b227d50f74 in kvm_cpu_exec (cpu=<optimized out>)
at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/accel/kvm/kvm-all.c:2376
#14 0x000055b227d35b7e in qemu_kvm_cpu_thread_fn (arg=0x55b229bc8250) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64/cpus.c:1318
#15 0x000055b228062904 in qemu_thread_start (args=0x55b229bf0cd0) at util/qemu-thread-posix.c:519
#16 0x00007f64815aa2de in start_thread () at /lib64/libpthread.so.0
#17 0x00007f64812dbe83 in clone () at /lib64/libc.so.6
qemu I used is qemu-kvm-4.2.0-29.module+el8.2.1+15117+e1f00de1.12.x86_64
— Additional comment from Guo, Zhiyi on 2022-06-09 06:58:11 UTC —
— Additional comment from Guo, Zhiyi on 2022-06-09 06:59:49 UTC —
@yama@redhat.com please check if you can reproduce the crash? Thanks!
Zhiyi
— Additional comment from Yanhui Ma on 2022-06-13 10:24:26 UTC —
I can also reproduce it with the steps of comment 17.
qemu-kvm-4.2.0-29.module+el8.2.1+9791+7d72b149.6.x86_64
kernel-4.18.0-193.19.1.el8_2.x86_64
win2016-64-virtio.raw
(qemu)
(qemu) qemu-kvm: /builddir/build/BUILD/qemu-4.2.0/hw/rtc/mc146818rtc.c:201: periodic_timer_update: Assertion `lost_clock >= 0' failed.
./cmd: line 29: 3791 Aborted (core dumped) /usr/libexec/qemu-kvm -name guest=instance-00005f1d,debug-threads=on -S -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=on -cpu SandyBridge-IBRS,vme=on,f16c=on,rdrand=on,hypervisor=on,arat=on,xsaveopt=on,abm=on -m 32768 -overcommit mem-lock=off -smp 6,sockets=6,dies=1,cores=1,threads=1 -uuid 773b0d15-a735-43bb-82cb-fdefcad28ea3 -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=20.4.1-1.20200917173450.el8ost,serial=773b0d15-a735-43bb-82cb-fdefcad28ea3,uuid=773b0d15-a735-43bb-82cb-fdefcad28ea3,family=Virtual Machine' -no-user-config -nodefaults -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -blockdev '{"driver":"file","filename":"/home/win2016-64-virtio.raw","aio":"native","node-name":"libvirt-2-storage","cache":
,"auto-read-only":true,"discard":"unmap"}' -blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":
{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=libvirt-2-format,id=virtio-disk0,bootindex=1,write-cache=on,serial=9b2c8658-4b54-409d-93eb-f934a8540ceb -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,rx_queue_size=512,host_mtu=9000,netdev=hostnet0,id=net0,mac=00:16:3e:09:55:49,bus=pci.0,addr=0x3 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc :0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -sandbox on -msg timestamp=on -monitor stdio
— Additional comment from Yanhui Ma on 2022-06-29 02:34:12 UTC —
Hello Kostiantyn
Could you please help check the reproduction steps on comment 17? Are they helpful to debug the bug?
— Additional comment from Kostiantyn Kostiuk on 2022-07-05 16:41:05 UTC —
Hi Yanhui Ma,
Yes, I reproduced the issue by steps on comment 17
— Additional comment from Kostiantyn Kostiuk on 2022-07-06 07:33:04 UTC —
Hi Yanhui Ma,
This bug is open for RHEL 8.2 but please try to reproduce it in RHEL 8.6 with QEMU v6.x
I reproduced it in RHEL 8.2, but can't in RHEL 8.6.
So, I want to verify that there is no bug in RHEL 8.6.
— Additional comment from Yanhui Ma on 2022-07-06 11:36:14 UTC —
(In reply to Kostiantyn Kostiuk from comment #23)
> Hi Yanhui Ma,
>
> This bug is open for RHEL 8.2 but please try to reproduce it in RHEL 8.6
> with QEMU v6.x
>
> I reproduced it in RHEL 8.2, but can't in RHEL 8.6.
> So, I want to verify that there is no bug in RHEL 8.6.
Hello Kostiantyn,
I have tried it RHEL8.7 host and the issue can be reproduced with both RHEL8.7 qemu and RHEL8.6 qemu.
qemu-kvm-6.2.0-16.module+el8.7.0+15743+c774064d.x86_64
qemu-kvm-6.2.0-11.module+el8.6.0+15668+464a1f31.2.x86_64
kernel-4.18.0-403.el8.x86_64
(qemu) qemu-kvm: ../hw/rtc/mc146818rtc.c:202: periodic_timer_update: Assertion `lost_clock >= 0' failed.
./cmd: line 29: 137292 Aborted (core dumped) /usr/libexec/qemu-kvm -name guest=instance-00005f1d,debug-threads=on -S -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=on -cpu SandyBridge-IBRS,vme=on,f16c=on,rdrand=on,hypervisor=on,arat=on,xsaveopt=on,abm=on -m 8G -overcommit mem-lock=off -smp 6,sockets=6,dies=1,cores=1,threads=1 -uuid 773b0d15-a735-43bb-82cb-fdefcad28ea3 -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=20.4.1-1.20200917173450.el8ost,serial=773b0d15-a735-43bb-82cb-fdefcad28ea3,uuid=773b0d15-a735-43bb-82cb-fdefcad28ea3,family=Virtual Machine' -no-user-config -nodefaults -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -blockdev '{"driver":"file","filename":"/home/win2016-64-virtio.raw","aio":"native","node-name":"libvirt-2-storage","cache":
,"auto-read-only":true,"discard":"unmap"}' -blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":
{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=libvirt-2-format,id=virtio-disk0,bootindex=1,write-cache=on,serial=9b2c8658-4b54-409d-93eb-f934a8540ceb -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,rx_queue_size=512,host_mtu=9000,netdev=hostnet0,id=net0,mac=00:16:3e:09:55:49,bus=pci.0,addr=0x3 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc :0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -sandbox on -msg timestamp=on -monitor stdio
— Additional comment from Kostiantyn Kostiuk on 2022-08-16 10:58:15 UTC —
I reserved a new PC with RHEL-8.7.0-20220816.0 BaseOS x86_64 and tried to reproduce this bug again with a clean Windows server 2016.
I can't reproduce it.
I tried with RHEL-8.2.1-updates-20200811.0 BaseOS x86_64 - successfully reproduce it.
RHEL 8.7 env:
qemu-kvm-6.2.0-18.module+el8.7.0+15999+d24f860e.x86_64
kernel-4.18.0-416.el8.x86_64
Windows_Server_2016_Datacenter_EVAL_en-us_14393_refresh.ISO
— Additional comment from Kostiantyn Kostiuk on 2022-08-16 11:03:19 UTC —
Hi Yanhui Ma,
Can you please reproduce this bug in RHEL 8.7 and provide me access to your env?
— Additional comment from Yanhui Ma on 2022-08-17 02:36:47 UTC —
(In reply to Kostiantyn Kostiuk from comment #26)
> Hi Yanhui Ma,
>
> Can you please reproduce this bug in RHEL 8.7 and provide me access to your
> env?
Hello Kostiantyn,
Yes I just reproduced the bug on the following host.
The host is dell-per440-22.lab.eng.pek2.redhat.com and password is kvmautotest.
The guest is /home/win2016-64-virtio.raw.
The qemu cmd line is /home/cmd.
But the guest are not freshly installed. I just copied the previous windows guest.
— Additional comment from Kostiantyn Kostiuk on 2022-08-22 10:43:32 UTC —
I reproduced the issue in my RHEL 8.7 env. Thanks!
— Additional comment from Kostiantyn Kostiuk on 2022-08-31 07:47:08 UTC —
I reproduced it on the current master branch.
I added some print at line 202 (before assert(lost_clock >= 0), https://gitlab.com/qemu-project/qemu/-/blob/master/hw/rtc/mc146818rtc.c#L202) and got the following values:
next_periodic_clock, old_period, last_periodic_clock, cur_clock, lost_clock, current_time
54439076429968, 32, 54439076429936, 54439076430178, 242, 1661348768010822000
54439076430224, 512, 54439076429712, 54439076430188, 476, 1661348768011117000
54439076430224, 32, 54439076430192, 54439076429884, -308, 1661348768001838000
The current_time value in the last print is lower than in the previous one.
So, the error occurs because time has gone backward.
I think this is a possible situation during time synchronization.
Continue investigation.
— Additional comment from Yanhui Ma on 2022-10-27 03:51:03 UTC —
(In reply to Kostiantyn Kostiuk from comment #29)
> I reproduced it on the current master branch.
>
> I added some print at line 202 (before assert(lost_clock >= 0),
> https://gitlab.com/qemu-project/qemu/-/blob/master/hw/rtc/mc146818rtc.
> c#L202) and got the following values:
>
> next_periodic_clock, old_period, last_periodic_clock, cur_clock, lost_clock,
> current_time
> 54439076429968, 32, 54439076429936, 54439076430178, 242, 1661348768010822000
> 54439076430224, 512, 54439076429712, 54439076430188, 476, 1661348768011117000
> 54439076430224, 32, 54439076430192, 54439076429884, -308, 1661348768001838000
>
> The current_time value in the last print is lower than in the previous one.
> So, the error occurs because time has gone backward.
>
> I think this is a possible situation during time synchronization.
>
> Continue investigation.
Hello Kostiantyn,
Could you please tell QE what the plan of fixing the bug is and any updates about the bug?
Since the bug is one customer bug, QE need to track the status and hope to complete the customer closed loop for the bug.
— Additional comment from Kostiantyn Kostiuk on 2022-10-27 07:57:35 UTC —
Hi Yanhui,
This bug is an upstream bug and we still investigate it.
A customer can only use hv_stimer instead.
— Additional comment from Yanhui Ma on 2023-04-07 03:29:55 UTC —
Hi Germano, Luigi and Kostiantyn,
Since the customer portal case has been closed and hv_stimer is a recommended configuration, with the hv_stimer customer will not hit the issue,
could we write a KCS for the bug and close the bug now?
— Additional comment from Luigi Tamagnone on 2023-04-07 10:18:16 UTC —
To be honest, I suggested more than one thing to the customer, and after the last advice, they never come back.
But for sure they didn't configure hv_stimer for RHOSP because as Artom wrote the customer has no direct control over libvirt and/or the qemu command line.
So I can not create a KCS about it for RHOSP. Maybe someone from sbr-virt hit the same issue and solved it with hv_stimer
— Additional comment from Yanhui Ma on 2023-04-10 06:39:12 UTC —
(In reply to Luigi Tamagnone from comment #33)
> To be honest, I suggested more than one thing to the customer, and after the
> last advice, they never come back.
Thank you for the info. I have tried with 'hv_stimer', with the flag, the bug can't be reproduced.
Considering there is no further request from customer, so should we first close the bug?
> But for sure they didn't configure hv_stimer for RHOSP because as Artom
> wrote the customer has no direct control over libvirt and/or the qemu
> command line.
> So I can not create a KCS about it for RHOSP. Maybe someone from sbr-virt
> hit the same issue and solved it with hv_stimer
— Additional comment from Germano Veit Michel on 2023-04-10 23:23:33 UTC —
(In reply to Yanhui Ma from comment #32)
> Hi Germano, Luigi and Kostiantyn,
>
> Since the customer portal case has been closed and hv_stimer is a
> recommended configuration, with the hv_stimer customer will not hit the
> issue,
> could we write a KCS for the bug and close the bug now?
I can do a KCS on this, no problem.
But why close the bug? AFAICT this is a real issue on its own, and hv_stimer is a configuration change (for the better) that also works around it.
Can't we just re-target it to when it aligns with the upstream fix? Once we ship a version with the fix we then close the bug.
And so far no need for Z-Streams.
Is this the upstream fix? https://www.mail-archive.com/qemu-devel@nongnu.org/msg924609.html
— Additional comment from Yanhui Ma on 2023-04-11 05:41:50 UTC —
(In reply to Germano Veit Michel from comment #35)
> (In reply to Yanhui Ma from comment #32)
> > Hi Germano, Luigi and Kostiantyn,
> >
> > Since the customer portal case has been closed and hv_stimer is a
> > recommended configuration, with the hv_stimer customer will not hit the
> > issue,
> > could we write a KCS for the bug and close the bug now?
>
> I can do a KCS on this, no problem.
>
Thank you for that.
> But why close the bug? AFAICT this is a real issue on its own, and hv_stimer
> is a configuration change (for the better) that also works around it.
> Can't we just re-target it to when it aligns with the upstream fix? Once we
> ship a version with the fix we then close the bug.
Yes, you are right and it makes sense for me.
The reason before I asked to close it is that customer doesn't have more requests and the bugs hasn't been updated for a long time.
Now that we have upstream fix, then we can keep it open.
> And so far no need for Z-Streams.
>
> Is this the upstream fix?
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg924609.html
— Additional comment from Kostiantyn Kostiuk on 2023-04-11 08:35:12 UTC —
I asked maintainers about this problem in upstream and have no comments about the proper fix.
https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg06465.html
I am not sure that this patch is correct. I think if time has gone backward we should skip this iteration of RTC loop.
This patch fix crash, but after lost_clock calculation, we fix RTC time according to lost_clock, but we don't need if lost_clock < 0
https://patchew.org/QEMU/1670228615-2684-1-git-send-email-baiyw2@chinatelecom.cn/
— Additional comment from Luigi Tamagnone on 2023-04-13 08:30:33 UTC —
For Customer perspective on case 03112469 I think we can close. they didn't come back. But I agree with Germano.
— Additional comment from Parth Shah on 2023-05-16 12:48:33 UTC —
Hello Luigi, Germano, Yanhui
Does it make sense to document this as a Known Issue in the 8.8 release notes? Does this also affect 9.2? Should we also add it to the release notes of the previous RHEL versions?
Draft note -
.Windows virtual machines unexpectedly shut down when running scheduled tasks
Currently, on Windows virtual machines (VM), if you use the real-time clock (RTC) to schedule tasks, the VM unexpectedly shuts down. To work around this problem, use the `hv_stimer` enlightenment to schedule your tasks.
Thanks!
— Additional comment from Germano Veit Michel on 2023-05-16 22:17:56 UTC —
(In reply to Parth Shah from comment #39)
> Hello Luigi, Germano, Yanhui
>
> Does it make sense to document this as a Known Issue in the 8.8 release
> notes? Does this also affect 9.2? Should we also add it to the release notes
> of the previous RHEL versions?
>
>
> Draft note -
>
> .Windows virtual machines unexpectedly shut down when running scheduled tasks
> Currently, on Windows virtual machines (VM), if you use the real-time clock
> (RTC) to schedule tasks, the VM unexpectedly shuts down. To work around this
> problem, use the `hv_stimer` enlightenment to schedule your tasks.
>
>
> Thanks!
Hi Parth,
Probably not, this affects much older versions as well, and the KCS 7007213 has so far 1 single customer case (the original one here).
I think its more of a corner case that is not worth adding to the release notes, we don't have to document every open bug there.
IMHO, to get a release note it should be common to reproduce and somewhat severe, otherwise KCS is enough.
Otherwise we may fill up release notes with tons of issues that are hard to hit and fixed in the first few batches anyway.
However, if we eventually reach some conclusion in this BZ that there is no way to fix this then I think it deserves one.
— Additional comment from Parth Shah on 2023-05-17 14:10:56 UTC —
Got it! Thanks, Germano!
I'll remove the release note flag for now. If the situation changes I am happy to help document this in the release notes.
- external trackers
- links to