Loading...

Type: Story
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhel-8.3.0
Component/s: qemu-kvm / Devices
Labels:

Epic Link:
RHEL-20260

Pool Team:

sst_virtualization
Sub-System Group:

ssg_virtualization

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None

Release Note Type:
Enhancement
Release Note Text:

Hide
Feature, enhancement (describe the feature or enhancement from the user’s point of view):
Reason (why has the feature or enhancement been implemented):
Result (what is the current user experience):

Show
Feature, enhancement (describe the feature or enhancement from the user’s point of view): Reason (why has the feature or enhancement been implemented): Result (what is the current user experience):

Experience:
Architecture:

Unspecified
Bugzilla Bug:
RHBZ: 1866110

SFDC Cases Counter:
SFDC Cases Links:

Description of problem:

The edk2 SMM infrastructure's SMRAM footprint is not constant, it grows
with (minimally) VCPU count (bug 1447027, bug 1819292) and guest
physical address space size (mainly RAM, but also 64-bit PCI MMIO
aperture) (bug 1468526). As of QEMU v5.1.0-rc2, the default TSEG size
for the Q35 machine type is 16MB, which is sufficient for most setups,
but not all. When TSEG runs out, the SRMAM exhaustion is noted in a
firmware assertion, and OVMF hangs.

This can be prevented by sizing TSEG correctly in advance, via "-global
mch.extended-tseg-mbytes=...". Libvirt exposes this property (bug
1469338) but Dan made the point that it's not comfortable to use, we
should automate the calculation (in QEMU or (less likely) libvirt).

A simple formula remains elusive. I've made some measurements and will
attach a table.

Version-Release number of selected component (if applicable):

current upstream QEMU (v5.1.0-rc2-33-gfd3cd581f9dc)
current upstream edk2 (e557442e3f7e)

How reproducible:

Always

Steps to Reproduce:

See any one of bug 1447027, bug 1468526, bug 1819292)

Actual results:

OVMF boot hangs with various ASSERTs reporting SMRAM exhaustion

Expected results:

OVMF boot succeeds without manually changing the TSEG size on the QEMU
cmdline or in the libvirtd domain XML.

Additional info:

(1) The attached table has six colums (all values decimal, in spite of
being zero padded on the left):

column #1:

Sum of the X and Y coordinates in the test matrix.

The X coordinate is log2(VCPU count).

The Y coordinate is (log2(RAM size in bytes)-30).

Keeping the sum constant (that is, keeping the sum of powers constant),
a diagonal in the matrix is identified (left bottom to top right, with
the top left corner being (0, 0)). Incrementing the sum by 1, the next
diagonal is identified. This allows for a diagonal-wise traversal of the
matrix, where the next diagonal is considered more "demanding" than the
previous one.

This column is not relevant to the results, its just the way how the
tests were run / organized.

column #2: VCPU count
column #3: guest RAM size in MB
column #4: whether the "pdpe1gb" CPU feature flag was enabled or
disabled

column #5:

The first whole power-of-two TSEG size (in MB) that enabled the guest to
boot; the smallest tried was 4MB, as 2MB is not sufficient for even
launching the edk2 SMM infrastructure, bare-bones.

column #6:

Result (always "good" in the table; while running the tests, different
values could be there).

(2) Test methodology:

A horrible script was written and used for generating the test cases in
the first place (basically columns #1 through #4 above). The VCPU count
would double from 1 up to 512 (2^9) and the RAM size would double from
1G (2^30) up to 16TB (2^44), both inclusive. I'm including the script
here.

> for ((VCPU_POW=0; VCPU_POW<10; VCPU_POW++)); do
> VCPUS=$((2**VCPU_POW))
> for ((MEM_POW=30; MEM_POW<45; MEM_POW++)); do
> MEM_MB=$((2**(MEM_POW-20)))
> for PDPE1GB in "-pdpe1gb" "+pdpe1gb"; do
> printf '%02u %03lu %08lu %s\n' \
> $((VCPU_POW+(MEM_POW-30))) $VCPUS $MEM_MB "$PDPE1GB"
> done
> done
> done \
> | sort -n

Another horrible (and I mean horrible) script was used to read back
the cases (one per line), and execute them one by one. I'm including it
here too. You've been warned.

> #!/bin/bash
> set -e -u -C
>
> QEMU=/opt/qemu-installed-optimized/bin/qemu-system-x86_64
> KERNEL=/boot/vmlinuz-3.10.0-1127.18.2.el7.x86_64
>
> while read DIST VCPUS MEM_MB PDPE1GB; do
> for ((TSEG_POW=2; TSEG_POW<11; TSEG_POW++)); do
> TSEG_MB=$(printf '%04u' $((2**TSEG_POW)))
> FWLOG=fwlog.$DIST.$VCPUS.$MEM_MB.$PDPE1GB.$TSEG_MB
> SERIAL=serial.$DIST.$VCPUS.$MEM_MB.$PDPE1GB.$TSEG_MB
> ERR=err.$DIST.$VCPUS.$MEM_MB.$PDPE1GB.$TSEG_MB
>
> rm -f – "$FWLOG" "$SERIAL" "$ERR"
> if ! $QEMU \
> -nodefaults \
> -nographic \
> -machine q35,accel=kvm,smm=on,kernel-irqchip=split \
> -smp $((10#$VCPUS)) \
> -m $((10#$MEM_MB)) \
> -cpu host,"$PDPE1GB" \
> -global driver=cfi.pflash01,property=secure,value=on \
> -drive if=pflash,unit=0,format=raw,readonly=on,file=/root/tseg-table/OVMF_CODE.4m.3264.fd \
> -drive if=pflash,unit=1,format=raw,snapshot=on,file=/root/tseg-table/OVMF_VARS.4m.fd \
> -chardev file,id=debugfile,path=$FWLOG \
> -device isa-debugcon,iobase=0x402,chardev=debugfile \
> -global mch.extended-tseg-mbytes=$((10#$TSEG_MB)) \
> -chardev file,id=serial,path=$SERIAL \
> -serial chardev:serial \
> -device intel-iommu,intremap=on,eim=on \
> -kernel $KERNEL \
> -append "ignore_loglevel earlyprintk=ttyS0,115200n8 console=ttyS0,115200n8 efi=debug initcall_blacklist=fw_cfg_sysfs_init" \
> -pidfile qemu.pid \
> -daemonize \
> >"$ERR" 2>&1; then
> RESULT=startup-failed
> break
> fi
>
> QEMU_PID=$(< qemu.pid)
>
> RESULT=
> while [ -z "$RESULT" ]; do
> if egrep \
> -q -w \
> 'ASSERT|ASSERT_EFI_ERROR|ASSERT_RETURN_ERROR' \
> $FWLOG \
> 2>/dev/null; then
> RESULT=tseg-too-small
> elif grep -q "Kernel panic" $SERIAL 2>/dev/null; then
> RESULT=good
> else
> sleep 1
> fi
> done
>
> kill $QEMU_PID
> while kill -0 $QEMU_PID 2>/dev/null; do
> sleep 1
> done
>
> if [ good = "$RESULT" ]; then
> break;
> fi
> done
> echo "$DIST $VCPUS $MEM_MB $PDPE1GB $TSEG_MB $RESULT"
> done < cases.txt >> cases.out.txt

The result for a test case is "good" when the firmware boots and the
guest kernel boots via fw_cfg sufficiently to panic due to lack of a
rootfs (initrd). In this case the smallest found TSEG is saved.

The result is "tseg-too-small" if even 1GB of TSEG is not sufficient for
reaching a good result. (Never seen, or expected, but I had to terminate
the loop in that case too, somehow.)

The result is "startup-failed" if QEMU rejected to start up.

(3) Test execution

Running this script (in Beaker) was an exercise in frustration and
tedium.

(3a) When reaching VCPU count 256, QEMU wouldn't start up without
"-device intel-iommu,intremap=on,eim=on".

(3b) With large (~1TB) guest RAM sizes, the guest kernel would
consistently crash in the fw_cfg guest kernel driver (which is a
built-in driver, not a module). Hence the
"initcall_blacklist=fw_cfg_sysfs_init" kernel cmdline option. Initially
I didn't filter for any random panic but failure to mount the rootfs. So
different panics would simply hang the script (in the "sleep 1" loop).
Yay.

(3c) When using large VCPU counts (>= 256 or thereabouts), the guest
kernel would occasionally hang before emitting anything at all to the
serial console (despite "earlyprintk"). This would again hang the script
(in the "sleep 1" loop).

This was completely random. Getting into "VCPU overcommit" domain
(>=160, which was the PCPU count on machine [2], see below) seemed to
contribute to the issue, statistically speaking.

The above test script hangs were the reason why I generated the test
cases separately, and processed / executed them in a different script.
This way I could stop the executor script at any point, trim the list of
remaining test cases, and restart execution. An unwelcome janitorial
activity for 1-2 nights. The particular fun bit was when this occurred
with a VCPU count of 384, where (due to heavy VCPU overcommit) OVMF
itself took 30+ minutes to start up (with the fix for bug 1861718
incorporated).

(3d) Beaker woes.

I searched Beaker for such machines available to me (both
permission-wise and from the immediate reservation perspective) that had
48+ PCPUs. I'd then review the hits for combined RAM + disk size
(expecting to use most of the disk as extra swap).

For starters, machine [1], used by Eduardo for bug 1819292, is not
available to me in Beaker (no permission). Worse than that, its combined
RAM + disk size is worse than the same on machine [2], which was
available to me. So I used [2]. No other candidate machines even entered
the game.

Alas, machine [2] in turn does not permit installing RHEL8. You read
that right. I had to use RHEL7 because of this.

Using RHEL7 forced me to limit the max VCPU count of the test corpus to
384, from 512. (I edited the test case list manually – so that's why
you see 384 and not 512 as the highest VCPU count in column #2 of the
output too, in the attachment.)

Furthermore, although machine [2] had more RAM + disk combined than
machine [1], it still only sufficed for 1.5 TB of guest RAM. So that's
why you don't see >= 2TB values in column #3 of the table, only 1.5 TB
after 1 TB.

Whoever analyses the table will have to guesstrapolate from the
parameter space that was covered, to larger VCPU counts and RAM sizes.

(Because the TSEG defaults to 16MB, only the lines with 32MB report a
practical problem at once. My intent with starting the TSEG scaling at
the artifically lowered 4MB was to support extrapolation – the "high
end" is constrained due to host hardware limits, so I wanted to make the
"low end" finer-grained than seen in practice otherwise. I hope that
seeing actual 4MB and 8MB values that in practice would be masked by the
16MB default will also support determining a trend.)

Note: I'm not volunteering for implementing the QEMU (let alone
libvirtd) feature; I'm merely providing the data I managed to collect.

external trackers

PnT-DevOps Jira RHELPLAN-50697

Red Hat Bugzilla 1447027

Red Hat Bugzilla 1468526

Red Hat Bugzilla 1469338

Red Hat Bugzilla 1819292

Red Hat Issue Tracker RHELPLAN-50697

(1 external trackers)