Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: rhel-9.6.z
Component/s: kernel / Core / Memory Management
Labels:
None

Regression:
No
Severity:
Moderate
AssignedTeam:
rhel-kernel-ft-plumbers-1

Story Points:
0
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

ProdDocsReview-CCS:
Unspecified
ProdDocsReview-Dev:
Unspecified
ProdDocsReview-QE:
Unspecified

Experience:
Architecture:

All

PX Impact Score:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

I have a case where veritas relied on FOLL_HONOR_NUMA_FAULT being set in is_valid_gup_args() to ensure gup calls would return accessible pages. This worked up through 9.4. With 785a8ecf0 (upstream 7acddcc1ae), setting FOLL_HONOR_NUMA_FAULT was removed. This resulted in an infinite loop with their vxfs module on 9.6. From them directly:

The infinite loop issue is observed in buffered write code path of VxFS. In VxFS, for buffered writes, user buffer pages are pre-faulted using get_user_pages_remote() and after that it copies user buffers pages into kernel pages using copy_from_user(). We think, currently issue is observing when numa balancing and marking page PROT_NONE is happening before call to get_user_pages(). This is the case where we think FOLL_HONOR_NUMA_FAULT is affecting. While there is another case, when numa balancing and marking page PROT_NONE happens after GUP call, this will not affect as the page will be held. We can able to repro hang issue using pin_user_pages() as well.

Digging into this matter has already occurred to some degree in an email chain with maintainers and the case with the end customer, however, the complexity of this matter necessitates a full jira ticket. The purpose of this ticket is not to troubleshoot third-party software but rather to help facilitate communications between the end customer and engineering, work towards clarification of expected behavior of more technically challenging parts of the linux kernel, and field potential fruitful questioning. Should the nature of this jira ticket change due to better understanding of the matter at hand, updates will be made accordingly.

What is the impact of this issue to you?

production outage when loop is hit

Please provide the package NVR for which the bug is seen:

I can work towards getting this one I forward the jira ticket to the end customer

How reproducible is this bug?:

intermittent

Steps to reproduce

details are in quotation from customer above

Expected results

the CPU throws a page fault for the OS to handle and refault it in

Actual results

potentially an infinite loop

Assignee:: Kernel FT Plumbers Scrum Group

Reporter:: Charles Haithcock

Developer:: Kernel FT Plumbers Scrum Group

QA Contact:: Kernel FT Plumbers QE group

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/08/27 6:00 PM

Updated:: 2025/10/07 8:20 PM

Resolved:: 2025/10/07 8:20 PM

Details

Description

What were you trying to do that didn't work?

What is the impact of this issue to you?

Please provide the package NVR for which the bug is seen:

How reproducible is this bug?:

Steps to reproduce

Expected results

Actual results

Attachments

Easy Agile Planning Poker

Activity

People

Dates