Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-111530

Infinite loop in gup calls on PROT_NONE pages migrated due to numa balancing

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • No
    • Moderate
    • rhel-kernel-ft-plumbers-1
    • 0
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • All
    • None

      What were you trying to do that didn't work?

      I have a case where veritas relied on FOLL_HONOR_NUMA_FAULT being set in is_valid_gup_args() to ensure gup calls would return accessible pages. This worked up through 9.4. With 785a8ecf0 (upstream 7acddcc1ae), setting FOLL_HONOR_NUMA_FAULT was removed. This resulted in an infinite loop with their vxfs module on 9.6. From them directly:

      The infinite loop issue is observed in buffered write code path of VxFS. In VxFS, for buffered writes, user buffer pages are pre-faulted using get_user_pages_remote() and after that it copies user buffers pages into kernel pages using copy_from_user(). We think, currently issue is observing when numa balancing and marking page PROT_NONE is happening before call to get_user_pages(). This is the case where we think FOLL_HONOR_NUMA_FAULT is affecting. While there is another case, when numa balancing and marking page PROT_NONE happens after GUP call, this will not affect as the page will be held. We can able to repro hang issue using pin_user_pages() as well.

      Digging into this matter has already occurred to some degree in an email chain with maintainers and the case with the end customer, however, the complexity of this matter necessitates a full jira ticket. The purpose of this ticket is not to troubleshoot third-party software but rather to help facilitate communications between the end customer and engineering, work towards clarification of expected behavior of more technically challenging parts of the linux kernel, and field potential fruitful questioning. Should the nature of this jira ticket change due to better understanding of the matter at hand, updates will be made accordingly.

      What is the impact of this issue to you?

      production outage when loop is hit

      Please provide the package NVR for which the bug is seen:

      I can work towards getting this one I forward the jira ticket to the end customer

      How reproducible is this bug?:

      intermittent

      Steps to reproduce

      1. details are in quotation from customer above

      Expected results

      the CPU throws a page fault for the OS to handle and refault it in

      Actual results

      potentially an infinite loop

              kernel-ft-plumbers Kernel FT Plumbers Scrum Group
              rhn-support-chaithco Charles Haithcock
              Kernel FT Plumbers Scrum Group Kernel FT Plumbers Scrum Group
              Kernel FT Plumbers QE group Kernel FT Plumbers QE group
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: