-
Bug
-
Resolution: Done
-
Normal
-
None
-
rhel-9.6.z
-
None
-
No
-
Moderate
-
rhel-kernel-ft-plumbers-1
-
0
-
False
-
False
-
-
None
-
None
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
-
All
-
None
What were you trying to do that didn't work?
I have a case where veritas relied on FOLL_HONOR_NUMA_FAULT being set in is_valid_gup_args() to ensure gup calls would return accessible pages. This worked up through 9.4. With 785a8ecf0 (upstream 7acddcc1ae), setting FOLL_HONOR_NUMA_FAULT was removed. This resulted in an infinite loop with their vxfs module on 9.6. From them directly:
The infinite loop issue is observed in buffered write code path of VxFS. In VxFS, for buffered writes, user buffer pages are pre-faulted using get_user_pages_remote() and after that it copies user buffers pages into kernel pages using copy_from_user(). We think, currently issue is observing when numa balancing and marking page PROT_NONE is happening before call to get_user_pages(). This is the case where we think FOLL_HONOR_NUMA_FAULT is affecting. While there is another case, when numa balancing and marking page PROT_NONE happens after GUP call, this will not affect as the page will be held. We can able to repro hang issue using pin_user_pages() as well.
Digging into this matter has already occurred to some degree in an email chain with maintainers and the case with the end customer, however, the complexity of this matter necessitates a full jira ticket. The purpose of this ticket is not to troubleshoot third-party software but rather to help facilitate communications between the end customer and engineering, work towards clarification of expected behavior of more technically challenging parts of the linux kernel, and field potential fruitful questioning. Should the nature of this jira ticket change due to better understanding of the matter at hand, updates will be made accordingly.
What is the impact of this issue to you?
production outage when loop is hit
Please provide the package NVR for which the bug is seen:
I can work towards getting this one I forward the jira ticket to the end customer
How reproducible is this bug?:
intermittent
Steps to reproduce
- details are in quotation from customer above
Expected results
the CPU throws a page fault for the OS to handle and refault it in
Actual results
potentially an infinite loop