Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2504

Poor throughput over high latency TCP connection when recv_buf_size is configured

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Minor Minor
    • 5.1
    • 5.0.0.Final
    • None
    • False
    • False
    • Undefined
    • Hide

      I included a test program based on the SimpleChat jgroups example. (I am not a Java developer, so please excuse any idiosyncrasy of the code...)

      • Create two physically distant Linux servers. I used two newly built CentOS 8 Linodes, one in Fremont, CA, and the other in Newark, NJ. Ping time between the servers is ~65 milliseconds.
      • Configure net.core.rmem_max and net.core_wmem_max to something large such as 32MB:
        sudo sysctl -w net.core.rmem_max=33554432 net.core.wmem_max=33554432
      • Copy the following files to both servers:
        • SpeedTest.class
        • jgroups-5.0.0.Final.jar
        • tcp-5.0.0.xml (a copy of tcp.xml in the jgroups-5.0.0.Final.jar)
      • Configure send_buf_size and recv_buf_size in tcp-5.0.0.xml:
         <TCP...
            send_buf_size="33554432"
            recv_buf_size="33554432"/>
      • Run SpeedTest on both machines and wait for them to connect:
        java -Djgroups.tcpping.initial_hosts=jgroups-west[7800],jgroups-east[7800] -cp jgroups-5.0.0.Final.jar:. SpeedTest
      • On either machine, enter the command "send" or "recv" to have that machine receive (or send) 16MB and output the estimated throughput in bytes/sec. One of these will be significantly slower than the other, and will correspond to data sent from client (connect side) to server (listen side)

      Example:

      [jgroups@jgroups-west ~]$ java -Djgroups.tcpping.initial_hosts=jgroups-west[7800],jgroups-east[7800] -cp jgroups-5.0.0.Final.jar:. SpeedTest
      Sep 29, 2020 6:16:59 PM org.jgroups.JChannel setAddress
      INFO: local_addr: d1845247-6de6-d80a-14cd-78524a0925fe, name: jgroups-west-24449
      
      -------------------------------------------------------------------
      GMS: address=jgroups-west-24449, cluster=SpeedTestCluster, physical address=45.79.68.10:7800
      -------------------------------------------------------------------
      ** view: [jgroups-east-34095|1] (2) [jgroups-east-34095, jgroups-west-24449]
      
      === NOTE - This instance is currently the connect() side ===
      
      > send
      Sending...
      > Sent 16777216 bytes at 2699498 bytes/sec
      recv
      Receiving...
      > Received 16777216 bytes at 15127927 bytes/sec
      
      === NOTE - Stopped and restarted the remote side ===
      
      > ** view: [jgroups-west-24449|2] (1) [jgroups-west-24449]
      ** view: [jgroups-west-24449|3] (2) [jgroups-west-24449, jgroups-east-47558]
      
      === NOTE - This instance is now the listen() side ===
      
      > send
      Sending...
      > Sent 16777216 bytes at 14863557 bytes/sec
      recv
      Receiving...
      > Received 16777216 bytes at 2626508 bytes/sec
      Show
      I included a test program based on the SimpleChat jgroups example. (I am not a Java developer, so please excuse any idiosyncrasy of the code...) Create two physically distant Linux servers. I used two newly built CentOS 8 Linodes, one in Fremont, CA, and the other in Newark, NJ. Ping time between the servers is ~65 milliseconds. Configure net.core.rmem_max and net.core_wmem_max to something large such as 32MB: sudo sysctl -w net.core.rmem_max=33554432 net.core.wmem_max=33554432 Copy the following files to both servers: SpeedTest.class jgroups-5.0.0.Final.jar tcp-5.0.0.xml (a copy of tcp.xml in the jgroups-5.0.0.Final.jar) Configure send_buf_size and recv_buf_size in tcp-5.0.0.xml: <TCP... send_buf_size= "33554432" recv_buf_size= "33554432" /> Run SpeedTest on both machines and wait for them to connect: java -Djgroups.tcpping.initial_hosts=jgroups-west[7800],jgroups-east[7800] -cp jgroups-5.0.0.Final.jar:. SpeedTest On either machine, enter the command "send" or "recv" to have that machine receive (or send) 16MB and output the estimated throughput in bytes/sec. One of these will be significantly slower than the other, and will correspond to data sent from client (connect side) to server (listen side) Example: [jgroups@jgroups-west ~]$ java -Djgroups.tcpping.initial_hosts=jgroups-west[7800],jgroups-east[7800] -cp jgroups-5.0.0.Final.jar:. SpeedTest Sep 29, 2020 6:16:59 PM org.jgroups.JChannel setAddress INFO: local_addr: d1845247-6de6-d80a-14cd-78524a0925fe, name: jgroups-west-24449 ------------------------------------------------------------------- GMS: address=jgroups-west-24449, cluster=SpeedTestCluster, physical address=45.79.68.10:7800 ------------------------------------------------------------------- ** view: [jgroups-east-34095|1] (2) [jgroups-east-34095, jgroups-west-24449] === NOTE - This instance is currently the connect() side === > send Sending... > Sent 16777216 bytes at 2699498 bytes/sec recv Receiving... > Received 16777216 bytes at 15127927 bytes/sec === NOTE - Stopped and restarted the remote side === > ** view: [jgroups-west-24449|2] (1) [jgroups-west-24449] ** view: [jgroups-west-24449|3] (2) [jgroups-west-24449, jgroups-east-47558] === NOTE - This instance is now the listen() side === > send Sending... > Sent 16777216 bytes at 14863557 bytes/sec recv Receiving... > Received 16777216 bytes at 2626508 bytes/sec

      I recently finished troubleshooting a unidirectional throughput bottleneck involving a JGroups application (Infinispan) communicating over a high-latency (~45 milliseconds) TCP connection.

      The root cause was JGroups improperly configuring the receive/send buffers on the listening socket. According to the tcp(7) man page:

      On individual connections, the socket buffer size must be set prior to
      the listen(2) or connect(2) calls in order to have it take effect.
      

      However, JGroups does not set the buffer size on the listening side until after accept().

      The result is poor throughput when sending data from client (connecting side) to server (listening side.) Because the issue is a too-small TCP receive window, throughput is ultimately latency-bound.

        1. SpeedTest.java
          4 kB
        2. rcvbuf.png
          rcvbuf.png
          72 kB
        3. delay-ip.sh
          0.6 kB
        4. bla7.java
          2 kB
        5. bla6.java
          2 kB
        6. bla5.java
          1 kB

            [JGRP-2504] Poor throughput over high latency TCP connection when recv_buf_size is configured

            Bela Ban added a comment -

            OK, so setting SO_RCVBUF works now, good to know. JGroups also sets SO_SNDBUF on sockets, but always before calling connect (client side). Also, setting SO_SNDBUF on a socket received as result of calling accept() apparently works. Besides, as discussed, there is no way of setting this on a ServerSocket in Java...
            Thanks for your detailed analysis; always great to work with experts in the field!
            Note to self: I should make myself familiar with the Linux networking code, all the more since I have kernel guys in my company that I can ask for advice!
            Cheers,

            Bela Ban added a comment - OK, so setting SO_RCVBUF works now, good to know. JGroups also sets SO_SNDBUF on sockets, but always before calling connect (client side). Also, setting SO_SNDBUF on a socket received as result of calling accept() apparently works. Besides, as discussed, there is no way of setting this on a ServerSocket in Java... Thanks for your detailed analysis; always great to work with experts in the field! Note to self: I should make myself familiar with the Linux networking code, all the more since I have kernel guys in my company that I can ask for advice! Cheers,

            net.core.rmem_max is the limit on how large of a SO_RCVBUF an application may request.  If the application attempts to request a larger buffer, it is clamped to the value specified in net.core.rmem_max.

            Buffer size and window size are related but distinct.  The buffer size is how much kernel memory is allocated to the socket; the receive window (a number advertised in the TCP header, indicating how much more data it is willing to receive at a given moment in time) depends on the buffer size and other factors.

            Setting SO_SNDBUF on a listening socket works without issue in C.  Here is strace output of the "iperf3" utility setting up its listening socket with a requested buffer size of 123456.  The setsockopt calls all return 0 (success):

            socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
            setsockopt(3, SOL_SOCKET, SO_RCVBUF, [123456], 4) = 0
            setsockopt(3, SOL_SOCKET, SO_SNDBUF, [123456], 4) = 0
            setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
            setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
            bind(3, {sa_family=AF_INET6, sin6_port=htons(7800), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
            listen(3, 5)                            = 0
            

            But based on what I've been able to piece together, it's completely OK – at least with Linux – to wait until after accepting a connection to configure SO_SNDBUF.  Considering that Java's ServerSocket lacks a setSendBufferSize method, I think it's safe to assume that it's OK on other platforms as well.

            Andrew Skalski (Inactive) added a comment - net.core.rmem_max is the limit on how large of a SO_RCVBUF an application may request.  If the application attempts to request a larger buffer, it is clamped to the value specified in net.core.rmem_max. Buffer size and window size are related but distinct.  The buffer size is how much kernel memory is allocated to the socket; the receive window (a number advertised in the TCP header, indicating how much more data it is willing to receive at a given moment in time) depends on the buffer size and other factors. Setting SO_SNDBUF on a listening socket works without issue in C.  Here is strace output of the "iperf3" utility setting up its listening socket with a requested buffer size of 123456.  The setsockopt calls all return 0 (success): socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [123456], 4) = 0 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [123456], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0 bind(3, {sa_family=AF_INET6, sin6_port=htons(7800), inet_pton(AF_INET6, "::" , &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0 listen(3, 5) = 0 But based on what I've been able to piece together, it's completely OK – at least with Linux – to wait until after accepting a connection to configure SO_SNDBUF.  Considering that Java's ServerSocket lacks a setSendBufferSize method, I think it's safe to assume that it's OK on other platforms as well.

            Bela Ban added a comment -

            OK, so this means setting SO_RCVBUF replaces net.core.rmem_max? And the initial buffer size is the default size set in sysctl.conf?

            Speaking of buffers: I assume buffer size == TCP recv/send window size?

            Last point: setting SO_SNDBUF on a server socket in Java is not supported:

            srv_sock.setOption(StandardSocketOptions.SO_RCVBUF, receive_buf_size) throws an exception. I haven't been in C land for quite a while, but do you know if this is supported in C?
            Cheers,

            Bela Ban added a comment - OK, so this means setting SO_RCVBUF replaces net.core.rmem_max ? And the initial buffer size is the default size set in sysctl.conf ? Speaking of buffers: I assume buffer size == TCP recv/send window size? Last point: setting SO_SNDBUF on a server socket in Java is not supported: srv_sock.setOption(StandardSocketOptions.SO_RCVBUF, receive_buf_size) throws an exception. I haven't been in C land for quite a while, but do you know if this is supported in C? Cheers,

            I tested TCP and TCP_NIO2, and the throughput issue is now resolved.  Thank you for looking into this!

            OK, so I read your comment. I understand socket options are inherited, but what sense does SO_SNDBUF make here? This option cannot be set on a ServerSocket anyway (perhaps this can be done in C?), only SO_RCVBUF, so does this mean the send-buffer cannot be set on a socket returned by accept()?

            That part of the documentation didn't make sense to me either.  Reading through the Linux sources and verifying empirically, the SO_SNDBUF option is definitely being honored, even when it is configured on the accepted socket.

            Also:

            If the application explicitly configures SO_RCVBUF, this automatic management is disabled

            Does this mean, the value of SO_RCVBUF is the max value a receive window can have?

            Yes.  SO_RCVBUF controls the buffer size, which in turn limits the receive window.  (The receive window grows/shrinks over time in response to quantity of data received, congestion and packet loss, the rate at which the receiving application consumes the data, etc.)  To illustrate, I configured both sites with SO_RCVBUF=200000 and graphed the window size over time during a send/receive test.

            Andrew Skalski (Inactive) added a comment - I tested TCP and TCP_NIO2, and the throughput issue is now resolved.  Thank you for looking into this! OK, so I read your comment. I understand socket options are inherited, but what sense does SO_SNDBUF make here? This option cannot be set on a ServerSocket anyway (perhaps this can be done in C?), only SO_RCVBUF, so does this mean the send-buffer cannot be set on a socket returned by accept()? That part of the documentation didn't make sense to me either.  Reading through the Linux sources and verifying empirically, the SO_SNDBUF option is definitely being honored, even when it is configured on the accepted socket. Also: If the application explicitly configures SO_RCVBUF, this automatic management is disabled Does this mean, the value of SO_RCVBUF is the max value a receive window can have? Yes.  SO_RCVBUF controls the buffer size, which in turn limits the receive window.  (The receive window grows/shrinks over time in response to quantity of data received, congestion and packet loss, the rate at which the receiving application consumes the data, etc.)  To illustrate, I configured both sites with SO_RCVBUF=200000 and graphed the window size over time during a send/receive test.

            Bela Ban added a comment -

            Bela Ban added a comment - More Info: http://diag.ddns.net/reports/TCPWindows.html

            Bela Ban added a comment -

            OK, so I read your comment. I understand socket options are inherited, but what sense does SO_SNDBUF make here? This option cannot be set on a ServerSocket anyway (perhaps this can be done in C?), only SO_RCVBUF, so does this mean the send-buffer cannot be set on a socket returned by accept()?

            Also:

            If the application explicitly configures SO_RCVBUF, this automatic management is disabled

            Does this mean, the value of SO_RCVBUF is the max value a receive window can have?

            Bela Ban added a comment - OK, so I read your comment. I understand socket options are inherited, but what sense does SO_SNDBUF make here? This option cannot be set on a ServerSocket anyway (perhaps this can be done in C?), only SO_RCVBUF, so does this mean the send-buffer cannot be set on a socket returned by accept()? Also: If the application explicitly configures SO_RCVBUF, this automatic management is disabled Does this mean, the value of SO_RCVBUF is the max value a receive window can have?

            Bela Ban added a comment -

            I also changed the Jira URL, this will show up in the next update of the web site.

            Bela Ban added a comment - I also changed the Jira URL, this will show up in the next update of the web site.

            Bela Ban added a comment -

            OK, it should work now, try it out and let me know.

            Cheers,

            Bela Ban added a comment - OK, it should work now, try it out and let me know. Cheers,

            Bela Ban added a comment -

            Taking me a little longer; I'm taking the opportunity for some refactoring.

            Bela Ban added a comment - Taking me a little longer; I'm taking the opportunity for some refactoring.

            Bela Ban added a comment -

            I'll reply in detail later, but I know what the problem is: I thought I was using TCP.recv_buf_size (which is set correctly), but instead I used TCP.TcpServer.recv_buf_size which is set after bind(). I'll fix this today.

            Bela Ban added a comment - I'll reply in detail later, but I know what the problem is: I thought I was using  TCP.recv_buf_size (which is set correctly), but instead I used TCP.TcpServer.recv_buf_size which is set after bind() . I'll fix this today.

              rhn-engineering-bban Bela Ban
              g-41394b97-fafa-4748-a281-6f88e12c80fa Andrew Skalski (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: