Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-95934

Stalled KVM async page fault resolution

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • rhel-9.6
    • None
    • No
    • Important
    • 6
    • rhel-virt-cloud
    • ssg_virtualization
    • 0
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • RHELOPC Sprint 49, RHELOPC Sprint 50, Virt-Cloud-Core Sprint 51, Virt-Cloud-Core Sprint 52, Virt-Cloud-Core Sprint 53, Virt-Cloud-Core Parking Lot
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • x86_64
    • None

      What were you trying to do that didn't work?

      In a KVM guest we are seeing user-space threads getting hung indefinitely on async page faults:

       

      Jun 03 19:45:03 kernel: INFO: task elasticsearch[e:7030 blocked for more than 1228 seconds.
      Jun 03 19:45:03 kernel:       Not tainted 5.14.0-570.19.1.el9_6.x86_64 #1
      Jun 03 19:45:03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jun 03 19:45:03 kernel: task:elasticsearch[e state:D stack:0     pid:7030  tgid:2707  ppid:1340   flags:0x00000002
      Jun 03 19:45:03 kernel: Call Trace:
      Jun 03 19:45:03 kernel:  <TASK>
      Jun 03 19:45:03 kernel:  __schedule+0x229/0x4a0
      Jun 03 19:45:03 kernel:  schedule+0x2e/0xb0
      Jun 03 19:45:03 kernel:  kvm_async_pf_task_wait_schedule+0xf3/0x180
      Jun 03 19:45:03 kernel:  ? __count_memcg_events+0x4f/0xb0
      Jun 03 19:45:03 kernel:  __kvm_handle_async_pf+0x53/0xb0
      Jun 03 19:45:03 kernel:  exc_page_fault+0x7d/0x150
      Jun 03 19:45:03 kernel:  asm_exc_page_fault+0x22/0x30
      Jun 03 19:45:03 kernel: RIP: 0033:0x7faaafc94fa0
      Jun 03 19:45:03 kernel: RSP: 002b:00007fa8e820f340 EFLAGS: 00010216
      Jun 03 19:45:03 kernel: RAX: 00007e3ffac80000 RBX: 0000000039a00000 RCX: 0000000000005400
      Jun 03 19:45:03 kernel: RDX: 0000000000010000 RSI: 00000003cd542e80 RDI: 00007e3ffac00000
      Jun 03 19:45:03 kernel: RBP: 00007fa8e820f340 R08: 000000000000abe8 R09: 00007e3ffac00000
      Jun 03 19:45:03 kernel: R10: 00007faaafc97a00 R11: 0000000000000000 R12: 0000000000000000
      Jun 03 19:45:03 kernel: R13: 00007faaa1000000 R14: 00000003cd542e70 R15: 00007fa868001dc0
      Jun 03 19:45:03 kernel:  </TASK>

       

      What is the impact of this issue to you?

      Huge impact as this can affect any process.

      Please provide the package NVR for which the bug is seen:

      Both the host and the guest are RHEL 9.6 with kernel-5.14.0-570.19.1.el9_6.x86_64
      Also seen in a guest with kernel-5.14.0-503.35.1.el9_5.x86_64

      How reproducible is this bug?:

      Always in customer's environment when under high load. Not reproduced locally.

      Steps to reproduce

      1.  
      2.  
      3.  

      Expected results

      No hung threads.

      Actual results

      Hung threads on async page faults.

              mlevitsk Maxim Levitsky
              rhn-support-jortialc Juan Orti
              virt-maint virt-maint
              virt-bugs virt-bugs
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: