Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-7113

Different behaviors for hotplugging dimm memory in guest with different access attr defined when there is nvdimm device plugged

    • None
    • None
    • rhel-sst-virt-tools
    • ssg_virtualization
    • 5
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None

      Description of problem:
      Different behaviors for hotplugging dimm memory in guest with different access attr defined when there is nvdimm device plugged

      Version-Release number of selected component (if applicable):
      libvirt-9.0.0-8.el9_2.x86_64
      qemu-kvm-7.2.0-11.el9_2.x86_64

      Guest version:
      os version: RHEL9.2
      kernel version: 5.14.0-284.el9.x86_64

      How reproducible:
      100%

      Steps to Reproduce:
      1. Create a 512M file
      truncate -s 512M /tmp/nvdimm

      2. Define and Start a guest with memory, numa and nvdimm related config xml as below:
      <maxMemory slots='16' unit='KiB'>52428800</maxMemory>
      <memory unit='KiB'>2097152</memory>
      <currentMemory unit='KiB'>2097152</currentMemory>
      ...
      <numa>
      <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='1048576' unit='KiB'/>
      </numa>
      ...
      <memory model='nvdimm'>
      <source>
      <path>/tmp/nvdimm</path>
      </source>
      <target>
      <size unit='KiB'>524288</size>
      <node>1</node>
      <label>
      <size unit='KiB'>256</size>
      </label>
      </target>
      </memory>
      ...

      3. Check the guest memory
      [in guest]

      1. cat /proc/meminfo | grep MemTotal
        MemTotal: 1736156 kB

      4. Prepare a access defined dimm memory device config xml:

      1. cat memory1.xml
        <memory model='dimm' access='shared'> <!-- or access='private' -->
        <source>
        <pagesize unit='KiB'>4</pagesize>
        </source>
        <target>
        <size unit='KiB'>524288</size>
        <node>0</node>
        </target>
        </memory>

      5. Hot plug the dimm memory device with config xml in step3

      1. virsh attach-device vm1 memory1.xml
        Device attached successfully

      6. Check the guest memory again and guest memory is not increased.
      [in guest]

      1. cat /proc/meminfo | grep MemTotal
        MemTotal: 1736156 kB

      7. Check dmesg in guest and find related error
      [in guest]

      1. dmesg
        ...
        [ 198.482981] Block size [0x8000000] unaligned hotplug range: start 0x11ffc0000, size 0x20000000
        [ 198.483017] acpi PNP0C80:01: add_memory failed
        [ 198.485362] acpi PNP0C80:01: acpi_memory_enable_device() error
        [ 198.486377] acpi PNP0C80:01: Enumeration failure

      8. If in step4 memory device is not defined without access attr like:

      1. cat memory1.xml
        <memory model='dimm'>
        <source>
        <pagesize unit='KiB'>4</pagesize>
        </source>
        .....

      Then in step6 the guest would increase as:
      [in guest]

      1. cat /proc/meminfo | grep MemTotal
        MemTotal: 2260444 kB

      Actual results:
      Different behavior for hotplugging dimm device in guest with different access attr.

      Expected results:
      Shared or private access defined dimm device should be same behavior with no defined dimm device.

      Additional info:
      Also checked other scenarios:
      Note: the guest area memory size of nvdimm is 524288 KiB - 256 KiB = 524032 KiB, which is not multiple of 128M

      If nvdimm guest area memory(total-size - label-size) is multiple of 128M as label size set as: 0 (no label size defined), 128M, 256M, 384M, then no matter how to set access attr, dimm device could be plugged successfully in guest.

      For dimm device has no access attr defined. If set nvdimm label size [0, 2M), [128M, 130M), [256M, 258M).. the dimm device could be plugged successfully in guest.

      So as the info above, the behaviors with different access attr defined are different.

            [RHEL-7113] Different behaviors for hotplugging dimm memory in guest with different access attr defined when there is nvdimm device plugged

            There is no demand for this and it's being kept open for long time. Please reopen if needed.

            Jaroslav Suchanek added a comment - There is no demand for this and it's being kept open for long time. Please reopen if needed.

            Agreed. Let's close this. We can always reopen if needed.

            Michal Privoznik added a comment - Agreed. Let's close this. We can always reopen if needed.

            John Ferlan added a comment -

            mprivozn@redhat.com - I see no activity on this for an extended period of time - are you ok with closing this as WONTFIX?  Additionally if it's still important perhaps create an upstream tracker.

            John Ferlan added a comment - mprivozn@redhat.com - I see no activity on this for an extended period of time - are you ok with closing this as WONTFIX?  Additionally if it's still important perhaps create an upstream tracker.

            pm-rhel added a comment -

            Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

            pm-rhel added a comment - Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

            (In reply to Michal Privoznik from comment #22)
            > (In reply to David Hildenbrand from comment #21)
            >
            > Spoiler alert: I know next to nothing about memory mgmt.
            >
            > > It's still sub-optimal, though. Hotplugging a 128 MiB DIMM first followed by
            > > a 256 MiB DIMM would unnecessarily create a hole ...
            >
            > Can you enlighten me please - why are holes bad? Is it because if a DIMM is
            > backed by a hugepage then it's wasteful?

            Because the GPA will be fragmented. For Linux, this implies that certain operations, such as memory compaction, get more expensive because Linux as to consider holes in memory zones and has to scan over these holes.

            Further, Linux cannot make use of that memory for larger allocations (such as gigantic pages). It's a secondary concern, though.

            > Also - how is this solved at real HW level? I mean, when I plug a DIMM into
            > a slot, it might too create a hole, couldn't it?

            I was told by Intel a while ago that real HW does not support hotplug of individual DIMMs, but only complete NUMA nodes. Holes between other nodes are less of a concern (in Linux, it's separate memory zones either way). So it's not really an issue on real HW.

            David Hildenbrand added a comment - (In reply to Michal Privoznik from comment #22) > (In reply to David Hildenbrand from comment #21) > > Spoiler alert: I know next to nothing about memory mgmt. > > > It's still sub-optimal, though. Hotplugging a 128 MiB DIMM first followed by > > a 256 MiB DIMM would unnecessarily create a hole ... > > Can you enlighten me please - why are holes bad? Is it because if a DIMM is > backed by a hugepage then it's wasteful? Because the GPA will be fragmented. For Linux, this implies that certain operations, such as memory compaction, get more expensive because Linux as to consider holes in memory zones and has to scan over these holes. Further, Linux cannot make use of that memory for larger allocations (such as gigantic pages). It's a secondary concern, though. > Also - how is this solved at real HW level? I mean, when I plug a DIMM into > a slot, it might too create a hole, couldn't it? I was told by Intel a while ago that real HW does not support hotplug of individual DIMMs, but only complete NUMA nodes. Holes between other nodes are less of a concern (in Linux, it's separate memory zones either way). So it's not really an issue on real HW.

            (In reply to David Hildenbrand from comment #21)

            Spoiler alert: I know next to nothing about memory mgmt.

            > It's still sub-optimal, though. Hotplugging a 128 MiB DIMM first followed by
            > a 256 MiB DIMM would unnecessarily create a hole ...

            Can you enlighten me please - why are holes bad? Is it because if a DIMM is backed by a hugepage then it's wasteful?
            Also - how is this solved at real HW level? I mean, when I plug a DIMM into a slot, it might too create a hole, couldn't it?

            Michal Privoznik added a comment - (In reply to David Hildenbrand from comment #21) Spoiler alert: I know next to nothing about memory mgmt. > It's still sub-optimal, though. Hotplugging a 128 MiB DIMM first followed by > a 256 MiB DIMM would unnecessarily create a hole ... Can you enlighten me please - why are holes bad? Is it because if a DIMM is backed by a hugepage then it's wasteful? Also - how is this solved at real HW level? I mean, when I plug a DIMM into a slot, it might too create a hole, couldn't it?

            (In reply to David Hildenbrand from comment #20)
            > (In reply to Michal Privoznik from comment #19)
            > > (In reply to David Hildenbrand from comment #17)
            > > > Getting that intended minimum alignment from the user is IMHO better than
            > > > hard-coding it in QEMU and having to deal with compat handling.
            > >
            > > But problem is whether user will know what value to put in. To sum up:
            > >
            > > QEMU knows what values are acceptable, but not which OS is running in the
            > > guest,
            > > libvirt does not know what value to pass, nor which OS is running in the
            > > guest,
            > > user does not know what value to pass, but it knows what OS is running in
            > > the guest.
            >
            > QEMU most certainly knows the least
            >
            > Again, the user already has to be aware of guest OS restrictions. While
            > hotplugging a 128 MiB DIMM to a VM running an arm64 Linux kernel with 4k
            > page size will work, it's unusable by an arm64 Linux kernel with a 64k page
            > size. Just like the minimum granularity, the alignment is guest-OS specific.
            >
            > >
            > > So I wonder whether we should:
            > > a) chose a reasonable default in QEMU, and possibly
            >
            > I'm afraid that will require compat machine changes.
            >
            > And there is no reasonable default for arm64, for example, without knowing
            > what's running inside the VM. Using an alignment of 512MiB just because the
            > guest could be running a 64k kernel fragments guest physical address space
            > when hotplugging 128 MiB DIMMs.

            BTW, I was playing with the idea of deciding the alignment based on the size.

            DIMM size is multiples of 128 MiB -> align to 128 MiB
            DIMM size is multiples of 256 MiB -> align to 256 MiB
            DIMM size is multiples of 512 MiB -> align to 512 MiB

            It's still sub-optimal, though. Hotplugging a 128 MiB DIMM first followed by a 256 MiB DIMM would unnecessarily create a hole ...

            David Hildenbrand added a comment - (In reply to David Hildenbrand from comment #20) > (In reply to Michal Privoznik from comment #19) > > (In reply to David Hildenbrand from comment #17) > > > Getting that intended minimum alignment from the user is IMHO better than > > > hard-coding it in QEMU and having to deal with compat handling. > > > > But problem is whether user will know what value to put in. To sum up: > > > > QEMU knows what values are acceptable, but not which OS is running in the > > guest, > > libvirt does not know what value to pass, nor which OS is running in the > > guest, > > user does not know what value to pass, but it knows what OS is running in > > the guest. > > QEMU most certainly knows the least > > Again, the user already has to be aware of guest OS restrictions. While > hotplugging a 128 MiB DIMM to a VM running an arm64 Linux kernel with 4k > page size will work, it's unusable by an arm64 Linux kernel with a 64k page > size. Just like the minimum granularity, the alignment is guest-OS specific. > > > > > So I wonder whether we should: > > a) chose a reasonable default in QEMU, and possibly > > I'm afraid that will require compat machine changes. > > And there is no reasonable default for arm64, for example, without knowing > what's running inside the VM. Using an alignment of 512MiB just because the > guest could be running a 64k kernel fragments guest physical address space > when hotplugging 128 MiB DIMMs. BTW, I was playing with the idea of deciding the alignment based on the size. DIMM size is multiples of 128 MiB -> align to 128 MiB DIMM size is multiples of 256 MiB -> align to 256 MiB DIMM size is multiples of 512 MiB -> align to 512 MiB It's still sub-optimal, though. Hotplugging a 128 MiB DIMM first followed by a 256 MiB DIMM would unnecessarily create a hole ...

            (In reply to Michal Privoznik from comment #19)
            > (In reply to David Hildenbrand from comment #17)
            > > Getting that intended minimum alignment from the user is IMHO better than
            > > hard-coding it in QEMU and having to deal with compat handling.
            >
            > But problem is whether user will know what value to put in. To sum up:
            >
            > QEMU knows what values are acceptable, but not which OS is running in the
            > guest,
            > libvirt does not know what value to pass, nor which OS is running in the
            > guest,
            > user does not know what value to pass, but it knows what OS is running in
            > the guest.

            QEMU most certainly knows the least

            Again, the user already has to be aware of guest OS restrictions. While hotplugging a 128 MiB DIMM to a VM running an arm64 Linux kernel with 4k page size will work, it's unusable by an arm64 Linux kernel with a 64k page size. Just like the minimum granularity, the alignment is guest-OS specific.

            >
            > So I wonder whether we should:
            > a) chose a reasonable default in QEMU, and possibly

            I'm afraid that will require compat machine changes.

            And there is no reasonable default for arm64, for example, without knowing what's running inside the VM. Using an alignment of 512MiB just because the guest could be running a 64k kernel fragments guest physical address space when hotplugging 128 MiB DIMMs.

            David Hildenbrand added a comment - (In reply to Michal Privoznik from comment #19) > (In reply to David Hildenbrand from comment #17) > > Getting that intended minimum alignment from the user is IMHO better than > > hard-coding it in QEMU and having to deal with compat handling. > > But problem is whether user will know what value to put in. To sum up: > > QEMU knows what values are acceptable, but not which OS is running in the > guest, > libvirt does not know what value to pass, nor which OS is running in the > guest, > user does not know what value to pass, but it knows what OS is running in > the guest. QEMU most certainly knows the least Again, the user already has to be aware of guest OS restrictions. While hotplugging a 128 MiB DIMM to a VM running an arm64 Linux kernel with 4k page size will work, it's unusable by an arm64 Linux kernel with a 64k page size. Just like the minimum granularity, the alignment is guest-OS specific. > > So I wonder whether we should: > a) chose a reasonable default in QEMU, and possibly I'm afraid that will require compat machine changes. And there is no reasonable default for arm64, for example, without knowing what's running inside the VM. Using an alignment of 512MiB just because the guest could be running a 64k kernel fragments guest physical address space when hotplugging 128 MiB DIMMs.

            (In reply to David Hildenbrand from comment #17)
            > Getting that intended minimum alignment from the user is IMHO better than
            > hard-coding it in QEMU and having to deal with compat handling.

            But problem is whether user will know what value to put in. To sum up:

            QEMU knows what values are acceptable, but not which OS is running in the guest,
            libvirt does not know what value to pass, nor which OS is running in the guest,
            user does not know what value to pass, but it knows what OS is running in the guest.

            So I wonder whether we should:
            a) chose a reasonable default in QEMU, and possibly
            b) offer users a way to tweak the alignment.

            Michal Privoznik added a comment - (In reply to David Hildenbrand from comment #17) > Getting that intended minimum alignment from the user is IMHO better than > hard-coding it in QEMU and having to deal with compat handling. But problem is whether user will know what value to put in. To sum up: QEMU knows what values are acceptable, but not which OS is running in the guest, libvirt does not know what value to pass, nor which OS is running in the guest, user does not know what value to pass, but it knows what OS is running in the guest. So I wonder whether we should: a) chose a reasonable default in QEMU, and possibly b) offer users a way to tweak the alignment.

            (In reply to Igor Mammedov from comment #16)
            > QEMU already reserves 1G of GPA per device, so why not align every one on 1G
            > border (without adding any new options)?

            We only do that on x86 so far IIRC, and only for memory devices that require an ACPI slot (we don't know how many other devices we might have). The underlying reason IIRC, was to handle memory backends with gigantic pages that require a certain alignment in GPA. So on x86 we could eventually align only such devices (DIMMs/NVDIMMs) to 1 GiB without further changes. For everything else, we could break existing setups eventually and would require some compat handling (I recall that any such gpa layout changes might require compat handling, but at least libvirt should be able to deal with that). A user option won't require gluing that to compat machines.

            Aligning all DIMMs to 1 GiB is also not really desired IMHO. If you hotplug multiple smaller DIMMs (< 1 GiB, which apparently users do for Kata and such), you'd get quite a lot of (large) GPA holes in between, implying that PFN walkers (like compaction) inside the VM get more expensive (i.e., zones not contiguous) and that such memory can never get used for larger contiguous allocations (such as gigantic pages).

            Ideally, we don't get any holes, even when hotplugging DIMMs that are any multiples of 128 MiB (on x86), which is the common case and only doesn't work because NVDIMMs do weird stuff with the labels. But that 128 MiB alignment is both guest and arch specific.

            Getting that intended minimum alignment from the user is IMHO better than hard-coding it in QEMU and having to deal with compat handling.

            David Hildenbrand added a comment - (In reply to Igor Mammedov from comment #16) > QEMU already reserves 1G of GPA per device, so why not align every one on 1G > border (without adding any new options)? We only do that on x86 so far IIRC, and only for memory devices that require an ACPI slot (we don't know how many other devices we might have). The underlying reason IIRC, was to handle memory backends with gigantic pages that require a certain alignment in GPA. So on x86 we could eventually align only such devices (DIMMs/NVDIMMs) to 1 GiB without further changes. For everything else, we could break existing setups eventually and would require some compat handling (I recall that any such gpa layout changes might require compat handling, but at least libvirt should be able to deal with that). A user option won't require gluing that to compat machines. Aligning all DIMMs to 1 GiB is also not really desired IMHO. If you hotplug multiple smaller DIMMs (< 1 GiB, which apparently users do for Kata and such), you'd get quite a lot of (large) GPA holes in between, implying that PFN walkers (like compaction) inside the VM get more expensive (i.e., zones not contiguous) and that such memory can never get used for larger contiguous allocations (such as gigantic pages). Ideally, we don't get any holes, even when hotplugging DIMMs that are any multiples of 128 MiB (on x86), which is the common case and only doesn't work because NVDIMMs do weird stuff with the labels. But that 128 MiB alignment is both guest and arch specific. Getting that intended minimum alignment from the user is IMHO better than hard-coding it in QEMU and having to deal with compat handling.

              mprivozn@redhat.com Michal Privoznik
              lcong@redhat.com Liang Cong
              Michal Privoznik Michal Privoznik
              Liang Cong Liang Cong
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated:
                Resolved: