Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-7480

automated TSEG size calculation

    • sst_virtualization
    • ssg_virtualization
    • 3
    • False
    • Hide

      None

      Show
      None
    • Enhancement
    • Hide
      Feature, enhancement (describe the feature or enhancement from the user’s point of view):
      Reason (why has the feature or enhancement been implemented):
      Result (what is the current user experience):
      Show
      Feature, enhancement (describe the feature or enhancement from the user’s point of view): Reason (why has the feature or enhancement been implemented): Result (what is the current user experience):

      • Description of problem:

      The edk2 SMM infrastructure's SMRAM footprint is not constant, it grows
      with (minimally) VCPU count (bug 1447027, bug 1819292) and guest
      physical address space size (mainly RAM, but also 64-bit PCI MMIO
      aperture) (bug 1468526). As of QEMU v5.1.0-rc2, the default TSEG size
      for the Q35 machine type is 16MB, which is sufficient for most setups,
      but not all. When TSEG runs out, the SRMAM exhaustion is noted in a
      firmware assertion, and OVMF hangs.

      This can be prevented by sizing TSEG correctly in advance, via "-global
      mch.extended-tseg-mbytes=...". Libvirt exposes this property (bug
      1469338) but Dan made the point that it's not comfortable to use, we
      should automate the calculation (in QEMU or (less likely) libvirt).

      A simple formula remains elusive. I've made some measurements and will
      attach a table.

      • Version-Release number of selected component (if applicable):
      • current upstream QEMU (v5.1.0-rc2-33-gfd3cd581f9dc)
      • current upstream edk2 (e557442e3f7e)
      • How reproducible:
      • Always
      • Steps to Reproduce:
      • See any one of bug 1447027, bug 1468526, bug 1819292)
      • Actual results:
      • OVMF boot hangs with various ASSERTs reporting SMRAM exhaustion
      • Expected results:
      • OVMF boot succeeds without manually changing the TSEG size on the QEMU
        cmdline or in the libvirtd domain XML.
      • Additional info:

      (1) The attached table has six colums (all values decimal, in spite of
      being zero padded on the left):

      • column #1:

      Sum of the X and Y coordinates in the test matrix.

      The X coordinate is log2(VCPU count).

      The Y coordinate is (log2(RAM size in bytes)-30).

      Keeping the sum constant (that is, keeping the sum of powers constant),
      a diagonal in the matrix is identified (left bottom to top right, with
      the top left corner being (0, 0)). Incrementing the sum by 1, the next
      diagonal is identified. This allows for a diagonal-wise traversal of the
      matrix, where the next diagonal is considered more "demanding" than the
      previous one.

      This column is not relevant to the results, its just the way how the
      tests were run / organized.

      • column #2: VCPU count
      • column #3: guest RAM size in MB
      • column #4: whether the "pdpe1gb" CPU feature flag was enabled or
        disabled
      • column #5:

      The first whole power-of-two TSEG size (in MB) that enabled the guest to
      boot; the smallest tried was 4MB, as 2MB is not sufficient for even
      launching the edk2 SMM infrastructure, bare-bones.

      • column #6:

      Result (always "good" in the table; while running the tests, different
      values could be there).

      (2) Test methodology:

      A horrible script was written and used for generating the test cases in
      the first place (basically columns #1 through #4 above). The VCPU count
      would double from 1 up to 512 (2^9) and the RAM size would double from
      1G (2^30) up to 16TB (2^44), both inclusive. I'm including the script
      here.

      > for ((VCPU_POW=0; VCPU_POW<10; VCPU_POW++)); do
      > VCPUS=$((2**VCPU_POW))
      > for ((MEM_POW=30; MEM_POW<45; MEM_POW++)); do
      > MEM_MB=$((2**(MEM_POW-20)))
      > for PDPE1GB in "-pdpe1gb" "+pdpe1gb"; do
      > printf '%02u %03lu %08lu %s\n' \
      > $((VCPU_POW+(MEM_POW-30))) $VCPUS $MEM_MB "$PDPE1GB"
      > done
      > done
      > done \
      > | sort -n

      Another horrible (and I mean horrible) script was used to read back
      the cases (one per line), and execute them one by one. I'm including it
      here too. You've been warned.

      > #!/bin/bash
      > set -e -u -C
      >
      > QEMU=/opt/qemu-installed-optimized/bin/qemu-system-x86_64
      > KERNEL=/boot/vmlinuz-3.10.0-1127.18.2.el7.x86_64
      >
      > while read DIST VCPUS MEM_MB PDPE1GB; do
      > for ((TSEG_POW=2; TSEG_POW<11; TSEG_POW++)); do
      > TSEG_MB=$(printf '%04u' $((2**TSEG_POW)))
      > FWLOG=fwlog.$DIST.$VCPUS.$MEM_MB.$PDPE1GB.$TSEG_MB
      > SERIAL=serial.$DIST.$VCPUS.$MEM_MB.$PDPE1GB.$TSEG_MB
      > ERR=err.$DIST.$VCPUS.$MEM_MB.$PDPE1GB.$TSEG_MB
      >
      > rm -f – "$FWLOG" "$SERIAL" "$ERR"
      > if ! $QEMU \
      > -nodefaults \
      > -nographic \
      > -machine q35,accel=kvm,smm=on,kernel-irqchip=split \
      > -smp $((10#$VCPUS)) \
      > -m $((10#$MEM_MB)) \
      > -cpu host,"$PDPE1GB" \
      > -global driver=cfi.pflash01,property=secure,value=on \
      > -drive if=pflash,unit=0,format=raw,readonly=on,file=/root/tseg-table/OVMF_CODE.4m.3264.fd \
      > -drive if=pflash,unit=1,format=raw,snapshot=on,file=/root/tseg-table/OVMF_VARS.4m.fd \
      > -chardev file,id=debugfile,path=$FWLOG \
      > -device isa-debugcon,iobase=0x402,chardev=debugfile \
      > -global mch.extended-tseg-mbytes=$((10#$TSEG_MB)) \
      > -chardev file,id=serial,path=$SERIAL \
      > -serial chardev:serial \
      > -device intel-iommu,intremap=on,eim=on \
      > -kernel $KERNEL \
      > -append "ignore_loglevel earlyprintk=ttyS0,115200n8 console=ttyS0,115200n8 efi=debug initcall_blacklist=fw_cfg_sysfs_init" \
      > -pidfile qemu.pid \
      > -daemonize \
      > >"$ERR" 2>&1; then
      > RESULT=startup-failed
      > break
      > fi
      >
      > QEMU_PID=$(< qemu.pid)
      >
      > RESULT=
      > while [ -z "$RESULT" ]; do
      > if egrep \
      > -q -w \
      > 'ASSERT|ASSERT_EFI_ERROR|ASSERT_RETURN_ERROR' \
      > $FWLOG \
      > 2>/dev/null; then
      > RESULT=tseg-too-small
      > elif grep -q "Kernel panic" $SERIAL 2>/dev/null; then
      > RESULT=good
      > else
      > sleep 1
      > fi
      > done
      >
      > kill $QEMU_PID
      > while kill -0 $QEMU_PID 2>/dev/null; do
      > sleep 1
      > done
      >
      > if [ good = "$RESULT" ]; then
      > break;
      > fi
      > done
      > echo "$DIST $VCPUS $MEM_MB $PDPE1GB $TSEG_MB $RESULT"
      > done < cases.txt >> cases.out.txt

      The result for a test case is "good" when the firmware boots and the
      guest kernel boots via fw_cfg sufficiently to panic due to lack of a
      rootfs (initrd). In this case the smallest found TSEG is saved.

      The result is "tseg-too-small" if even 1GB of TSEG is not sufficient for
      reaching a good result. (Never seen, or expected, but I had to terminate
      the loop in that case too, somehow.)

      The result is "startup-failed" if QEMU rejected to start up.

      (3) Test execution

      Running this script (in Beaker) was an exercise in frustration and
      tedium.

      (3a) When reaching VCPU count 256, QEMU wouldn't start up without
      "-device intel-iommu,intremap=on,eim=on".

      (3b) With large (~1TB) guest RAM sizes, the guest kernel would
      consistently crash in the fw_cfg guest kernel driver (which is a
      built-in driver, not a module). Hence the
      "initcall_blacklist=fw_cfg_sysfs_init" kernel cmdline option. Initially
      I didn't filter for any random panic but failure to mount the rootfs. So
      different panics would simply hang the script (in the "sleep 1" loop).
      Yay.

      (3c) When using large VCPU counts (>= 256 or thereabouts), the guest
      kernel would occasionally hang before emitting anything at all to the
      serial console (despite "earlyprintk"). This would again hang the script
      (in the "sleep 1" loop).

      This was completely random. Getting into "VCPU overcommit" domain
      (>=160, which was the PCPU count on machine [2], see below) seemed to
      contribute to the issue, statistically speaking.

      The above test script hangs were the reason why I generated the test
      cases separately, and processed / executed them in a different script.
      This way I could stop the executor script at any point, trim the list of
      remaining test cases, and restart execution. An unwelcome janitorial
      activity for 1-2 nights. The particular fun bit was when this occurred
      with a VCPU count of 384, where (due to heavy VCPU overcommit) OVMF
      itself took 30+ minutes to start up (with the fix for bug 1861718
      incorporated).

      (3d) Beaker woes.

      I searched Beaker for such machines available to me (both
      permission-wise and from the immediate reservation perspective) that had
      48+ PCPUs. I'd then review the hits for combined RAM + disk size
      (expecting to use most of the disk as extra swap).

      For starters, machine [1], used by Eduardo for bug 1819292, is not
      available to me in Beaker (no permission). Worse than that, its combined
      RAM + disk size is worse than the same on machine [2], which was
      available to me. So I used [2]. No other candidate machines even entered
      the game.

      Alas, machine [2] in turn does not permit installing RHEL8. You read
      that right. I had to use RHEL7 because of this.

      Using RHEL7 forced me to limit the max VCPU count of the test corpus to
      384, from 512. (I edited the test case list manually – so that's why
      you see 384 and not 512 as the highest VCPU count in column #2 of the
      output too, in the attachment.)

      Furthermore, although machine [2] had more RAM + disk combined than
      machine [1], it still only sufficed for 1.5 TB of guest RAM. So that's
      why you don't see >= 2TB values in column #3 of the table, only 1.5 TB
      after 1 TB.

      Whoever analyses the table will have to guesstrapolate from the
      parameter space that was covered, to larger VCPU counts and RAM sizes.

      (Because the TSEG defaults to 16MB, only the lines with 32MB report a
      practical problem at once. My intent with starting the TSEG scaling at
      the artifically lowered 4MB was to support extrapolation – the "high
      end" is constrained due to host hardware limits, so I wanted to make the
      "low end" finer-grained than seen in practice otherwise. I hope that
      seeing actual 4MB and 8MB values that in practice would be masked by the
      16MB default will also support determining a trend.)

      Note: I'm not volunteering for implementing the QEMU (let alone
      libvirtd) feature; I'm merely providing the data I managed to collect.

            rhn-engineering-imammedo Igor Mammedov
            rhn-engineering-lersek Laszlo Ersek
            Igor Mammedov Igor Mammedov
            Xueqiang Wei Xueqiang Wei
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated:
              Resolved: