Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-135143

[RHEL10.2] Guest kernel crashes due to memory error injection

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Yes
    • Important
    • 1
    • rhel-virt-hwe-arm-1
    • 23
    • 27
    • None
    • QE ack, Dev ack
    • False
    • False
    • Hide

      None

      Show
      None
    • No
    • Split items
    • Unspecified Release Note Type - Unknown
    • Unspecified
    • Unspecified
    • Unspecified
    • aarch64
    • None
    • Merge Request passes all submitter checks, Merge Request finished CI testing, Merge Request passed CI testing, Merge Request approved by peer review

      What were you trying to do that didn't work?

      The latest RHEL10.2 guest kernel crashes due to memory error injection

      What is the impact of this issue to you?

      The memory RAS feature is broken

      Please provide the package NVR for which the bug is seen:

      host: 6.12.0-170.el10.aarch64
      guest: 6.12.0-170.el10.aarch64
      qemu: qemu-kvm-10.1.0-5.el10

      How reproducible is this bug?:

      Steps to reproduce

      1. Provisioning host and guest, both are 6.12.0-170.el10.aarch64
      2. On the guest, build 'victim' binary
        guest$ git clone git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
        guest$ cd mce-test/tools; make; make install
        guest$ cp ../bin/victim ~/victim
      3. On the host, build 'test' binary (source code attached)
        host$ gcc test.c -o test; cp test ~/test
      4. Start the guest _with_ 4GB memory and one NUMA node
        /home/gavin/sandbox/qemu.rhel/build/qemu-system-aarch64                         \
        -accel kvm -machine virt-rhel10.2.0,gic-version=host,nvdimm=on,ras=on        \
        -cpu host -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1   \
        -m 4096M,slots=16,maxmem=128G                                                                       \
        -object memory-backend-ram,id=mem0,size=4096M                                          \
        -numa node,nodeid=0,cpus=0-7,memdev=mem0                                                \
        -L /home/gavin/sandbox/qemu.rhel/build/pc-bios                                              \
        -monitor none -serial mon:stdio -nographic -gdb tcp::6666                               \
        -qmp tcp:localhost:5555,server,wait=off                                                               \
        -bios /home/gavin/sandbox/qemu.rhel/build/pc-bios/edk2-aarch64-code.fd  \
        -boot c                                                                                                                        \
        -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1                                         \
        -device pcie-root-port,bus=pcie.0,chassis=2,id=pcie.2                                         \
        -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=drive0          \
        -device virtio-blk-pci,id=virtblk0,bus=pcie.1,drive=drive0,num-queues=4         \
        -netdev tap,id=tap1,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown                                                                                                                        \
        -device virtio-net-pci,bus=pcie.2,netdev=tap1,mac=52:54:00:f1:26:b0
      5. On guest, execute '~/victim -d'
        guest$ ~/victim -d
        physical address of (0xffff96a1e000) = 0x126002000
        Hit any key to trigger error: 
      6. On host, execute '~/test 0x126002000'
        host$ ~/test 0x126002000
      7. On guest, press enter key to continue the execution of 'victim', then the guest
        kernel crashes and the following kernel log is found from '/var/crash/xxxx'.
        [  209.148986] Unable to handle kernel write to read-only memory at virtual address ffff800080065008
        [  209.148991] Mem abort info:
        [  209.148992]   ESR = 0x000000009600004f
        [  209.148993]   EC = 0x25: DABT (current EL), IL = 32 bits
        [  209.148995]   SET = 0, FnV = 0
        [  209.148996]   EA = 0, S1PTW = 0
        [  209.148996]   FSC = 0x0f: level 3 permission fault
        [  209.148997] Data abort info:
        [  209.148998]   ISV = 0, ISS = 0x0000004f, ISS2 = 0x00000000
        [  209.148999]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
        [  209.149000]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
        [  209.149001] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000dbcde000
        [  209.149003] [ffff800080065008] pgd=10000001001d9403, p4d=10000001001d9403, pud=10000001001da403, pmd=10000001001db403, pte=006000013c750f83
        [  209.149007] Internal error: Oops: 000000009600004f 1  SMP
        [  209.149010] Modules linked in: rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables vfat fat nfit libnvdimm fuse loop vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs nvme_tcp nvme_fabrics nvme_core nvme_keyring nvme_auth crct10dif_ce virtio_net ghash_ce sha2_ce sha256_arm64 net_failover sha1_ce failover virtio_blk dm_mirror dm_region_hash dm_log dm_mod nfnetlink
        [  209.149040] CPU: 3 UID: 0 PID: 1857 Comm: victim Kdump: loaded Not tainted 6.12.0-170.el10.aarch64 #1 PREEMPT(voluntary) 
        [  209.149043] Hardware name: Red Hat KVM, BIOS edk2-stable202408-prebuilt.qemu.org 08/13/2024
        [  209.149044] pstate: 604001c5 (nZCv dAIF +PAN UAO -TCO -DIT -SSBS BTYPE=-)
        [  209.149046] pc : acpi_os_write_memory+0x130/0x1a0
        [  209.149052] lr : acpi_os_write_memory+0x2c/0x1a0
        [  209.149054] sp : ffff80008866bc50
        [  209.149055] x29: ffff80008866bc50 x28: ffff0000c8cac440 x27: 00000000000000c4
        [  209.149057] x26: ffffc0d6d64d9298 x25: ffffc0d6d48a7688 x24: ffff800080695018
        [  209.149059] x23: ffff80008866bd14 x22: 0000000000000008 x21: 0000000000000040
        [  209.149061] x20: 0000000000000001 x19: 000000013c750008 x18: 0000000000000000
        [  209.149063] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
        [  209.149064] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
        [  209.149067] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc0d6d3f94cdc
        [  209.149068] x8 : 0000000000000020 x7 : 000000013c750008 x6 : ffffc0d6d6138fc0
        [  209.149070] x5 : 000000013c751000 x4 : 0000000000000008 x3 : ffff0000c0c50960
        [  209.149072] x2 : 0000000000000040 x1 : ffff0000c8cac440 x0 : ffff800080065008
        [  209.149074] Call trace:
        [  209.149075]  acpi_os_write_memory+0x130/0x1a0 (P)
        [  209.149078]  apei_write+0xcc/0xe8
        [  209.149082]  ghes_clear_estatus.part.0+0xc8/0xe0
        [  209.149084]  ghes_in_nmi_queue_one_entry+0x1e4/0x330
        [  209.149086]  ghes_notify_sea+0x60/0x110
        [  209.149088]  apei_claim_sea+0xa4/0x310
        [  209.149090]  do_sea+0xa8/0xd0
        [  209.149093]  do_mem_abort+0x48/0xa0
        [  209.149095]  el0_da+0x48/0x160
        [  209.149099]  el0t_64_sync_handler+0xd0/0xf0
        [  209.149101]  el0t_64_sync+0x1ac/0x1b0
        [  209.149104] Code: 17ffffeb 710102bf 54000341 d50332bf (f9000014) 
        [  209.149107] SMP: stopping secondary CPUs
        [  209.149774] Starting crashdump kernel...
        [  209.149775] Bye!

      Expected results

      The injected memory error is detected by QEMU and reported to the guest kernel without causing a guest kernel crash

      Actual results

      The injected memory error is detected by QEMU, but caused the guest kernel crash

              rh-ee-gshan Guowen Shan
              rh-ee-gshan Guowen Shan
              Guowen Shan
              virt-maint virt-maint
              Julia Graham Julia Graham
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: