Loading...

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: rhel-9.4.z
Component/s: kernel / File Systems / NFS
Labels:
None

Regression:
No
Severity:
Moderate
Epic Link:
RHELPLAN-53660
Keywords:

ZStream

Pool Team:

rhel-sst-filesystems

Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Products:

Red Hat Enterprise Linux
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Experience:
Architecture:

x86_64

PX Impact Score:
PX Technical Impact:
PX Impact Range:
PX Review Complete:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

Let me explain a bit more. In the network trace uploaded by Olga (forNetapp.tgz), if you filter to look at the TCP session after the LIF migrate (use filter: "tcp.stream eq 2" in wireshark),

10.6.54.219 -> NFS client. 10.6.54.90 -> NFS server
All is well till:

14659 2024-10-14 17:07:13.390148 10.6.54.219 10.6.54.90 NFS 15994 V3 WRITE Call (Reply In 14938), FH: 0x348996c4 Offset: 1159528448 Len: 65536 UNSTABLE

here the client's recv window is of size 1. By the next write call it's full:

14666 2024-10-14 17:07:13.390226 10.6.54.219 10.6.54.90 NFS 15994 [TCP ZeroWindow] V3 WRITE Call (Reply In 14938), FH: 0x348996c4 Offset: 1200750592 Len: 65536 UNSTABLE
14674 2024-10-14 17:07:13.390302 10.6.54.219 10.6.54.90 NFS 15994 [TCP ZeroWindow] V3 WRITE Call (Reply In 14938), FH: 0x348996c4 Offset: 1159659520 Len: 65536 UNSTABLE
14679 2024-10-14 17:07:13.390367 10.6.54.219 10.6.54.90 NFS 17442 [TCP ZeroWindow] V3 WRITE Call (Reply In 14938), FH: 0x348996c4 Offset: 1200816128 Len: 65536 UNSTABLE

and continues to remain full. Which means that the server cannot respond to the write requests.

Client continues to pile on the writes:

14684 2024-10-14 17:07:13.390442 10.6.54.219 10.6.54.90 NFS 33370 [TCP ZeroWindow] V3 WRITE Call (Reply In 14938), FH: 0x348996c4 Offset: 1159725056 Len: 65536 UNSTABLE
...
14749 2024-10-14 17:07:13.391172 10.6.54.219 10.6.54.90 NFS 17442 [TCP ZeroWindow] V3 WRITE Call (Reply In 14938), FH: 0x348996c4 Offset: 1159921664 Len: 65536 UNSTABLE
...
14754 2024-10-14 17:07:13.391240 10.6.54.219 10.6.54.90 NFS 15994 [TCP ZeroWindow] V3 WRITE Call, FH: 0x348996c4 Offset: 1201209344 Len: 65536 UNSTABLE (server has not responded to this and further writes. Client's TCP window continues to remain 0)
...
14841 2024-10-14 17:07:13.392312 10.6.54.219 10.6.54.90 NFS 17442 [TCP ZeroWindow] V3 WRITE Call, FH: 0x348996c4 Offset: 1158610944 Len: 65536 UNSTABLE
14850 2024-10-14 17:07:13.392378 10.6.54.219 10.6.54.90 NFS 15994 [TCP ZeroWindow] V3 WRITE Call, FH: 0x348996c4 Offset: 1201733632 Len: 65536 UNSTABLE
... and more NFS writes
14917 2024-10-14 17:07:13.393254 10.6.54.219 10.6.54.90 NFS 33370 [TCP ZeroWindow] V3 WRITE Call, FH: 0x348996c4 Offset: 1202126848 Len: 65536 UNSTABLE

Till finally the client updates the recv window:

14929 2024-10-14 17:07:13.393323 10.6.54.219 10.6.54.90 TCP 66 [TCP Window Update] 891 → 2049 [ACK] Seq=33363876 Ack=85465 Win=14336 Len=0 TSval=2379135881 TSecr=1657201722

Server replies to many requests after the recv window in no longer 0 (NOTE: server's window size is not 0):

14938 2024-10-14 17:07:13.393405 10.6.54.90 10.6.54.219 NFS 3646 V3 WRITE Reply (Call In 14614) Len: 65536 FILE_SYNC

Client continues to write with very small recv window size:
14940 2024-10-14 17:07:13.393437 10.6.54.219 10.6.54.90 NFS 15994 V3 WRITE Call, FH: 0x348996c4 Offset: 1202192384 Len: 65536 UNSTABLE (win size 11)
...
14917 2024-10-14 17:07:13.393254 10.6.54.219 10.6.54.90 NFS 33370 [TCP ZeroWindow] V3 WRITE Call, FH: 0x348996c4 Offset: 1202126848 Len: 65536 UNSTABLE
...

and the client's window is again full by:

15119 2024-10-14 17:07:13.395705 10.6.54.219 10.6.54.90 NFS 25274 [TCP Window Full] V3 WRITE Call, FH: 0x348996c4 Offset: 1161822208 Len: 65536 UNSTABLE

Client updates the window again:
15124 2024-10-14 17:07:13.396079 10.6.54.219 10.6.54.90 TCP 66 [TCP Window Update] 891 → 2049 [ACK] Seq=35494476 Ack=89045 Win=27904 Len=0 TSval=2379135884 TSecr=1657201724

But now server's window is now full and it's unable to respond like it did earlier when the window wasn't yet full:

15136 2024-10-14 17:07:13.641350 10.6.54.90 10.6.54.219 TCP 66 [TCP ZeroWindow] 2049 → 891 [ACK] Seq=89045 Ack=35494988 Win=0 Len=0 TSval=1657201970 TSecr=2379136089

This behavior has caused the server's TCP stack to go into extreme flow control causing it to not respond even after client's window size is increased. This is what we are fixing.
The question I have is why is the client's TCP window size so small that it goes to 0 preventing the server from responding? What's preventing the client from reading the responses causing server's responses to pile up?

Even if we fix the server code, we'll still have very bad performance if the client is not going to read the response quickly.
Internally, in our testing we are hitting this issue (where we don't see a SYN from client after a RST from the server caused by a LIF move) more often. However, we have managed to hit the scenario that Olga is seeing (post LIF move client reconnects with server but after some amount of writes, server is not accepting anymore writes and reporting a TCP 0 window size that doesn't grow leading to no further traffic from client to server) and are actively triaging it.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

forNetapp.tgz
166.60 MB
2024/11/13 10:16 AM
forRH.tar.gz
77.20 MB
2024/11/13 10:16 AM
forRH41TLSfail.tar.gz
110.84 MB
2024/11/13 10:16 AM
forRH41TLSsuccess.tar.gz
108.27 MB
2024/11/13 10:16 AM
forRHDetailed.tar.gz
70.16 MB
2024/11/13 10:16 AM
forRHNoTLS.tar.gz
108.17 MB
2024/11/13 10:16 AM
forRHRstRepro.tar.gz
92.81 MB
2024/11/13 10:16 AM
forRHRstRepro2.tar.gz
94.60 MB
2024/11/13 10:16 AM
forRHRstRepro3.tar.gz
85.45 MB
2024/11/13 10:16 AM
nfs-hang.tar.gz
54.50 MB
2024/11/13 10:16 AM

clones

RHEL-60028 NFS client TLS sock_close hang

Release Pending

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates