Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1609

Poor performance of TCP/TCPGOSSIP when a node fails

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 3.3
    • 3.1, 3.3
    • None
    • Hide

      1. Update tcp.xml to use TCPGOSSIP instead of TCPPING and MERGE3 instead of MERGE2.
      2. Run Mperf with 3 nodes
      3. In the middle of the test, shutdown network interface on C with "ifdown eth0".
      4. Observe the view change, which has only A,B in the view.
      5. After a little bit, the performance turns very poor.

      Show
      1. Update tcp.xml to use TCPGOSSIP instead of TCPPING and MERGE3 instead of MERGE2. 2. Run Mperf with 3 nodes 3. In the middle of the test, shutdown network interface on C with "ifdown eth0". 4. Observe the view change, which has only A,B in the view. 5. After a little bit, the performance turns very poor.

      I have TCP transport with TCPGOSSIP for discovery. The exact config is same as tcp.xml but has TCPGOSSIP instead of TCPPING and MERGE3 instead of MERGE2.
      When I run Mperf with this stack on nodes A,B,C and in the middle of the test, shutdown the network interface of a node, say C, C is removed from the view after FD interval but the subsequent performance is very poor, in tens of KB rather than tens to hundrend MB.

      What I observed is that the TrasnferQueueBundler in the TP keeps trying to connect to node C for every multicast and times out.

      When I disable bundling with 3.1, the Mperf sender thread gets into the same condition for every multicast.

      Logically, once the view has changed, we want the nodes A and B to continue to perform at the same rate they were doing before the network interface for C is shutdown.

      The following stack trace shows where it is doing the connect to node C with bundling enabled.

      "TransferQueueBundler,mperf,lnx1-60691" prio=10 tid=0x00002aaab4024000 nid=0x6f77 runnable [0x00000000420f7000]
      java.lang.Thread.State: RUNNABLE
      at java.net.PlainSocketImpl.socketConnect(Native Method)
      at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)

      • locked <0x00000000f223f0f0> (a java.net.SocksSocketImpl)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:529)
        at org.jgroups.util.Util.connect(Util.java:354)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.connect(TCPConnectionMap.java:408)
        at org.jgroups.blocks.TCPConnectionMap$Mapper.getConnection(TCPConnectionMap.java:775)
        at org.jgroups.blocks.TCPConnectionMap.send(TCPConnectionMap.java:184)
        at org.jgroups.protocols.TCP.send(TCP.java:56)
        at org.jgroups.protocols.BasicTCP.sendUnicast(BasicTCP.java:99)
        at org.jgroups.protocols.TP.sendToAllPhysicalAddresses(TP.java:1611)
        at org.jgroups.protocols.BasicTCP.sendMulticast(BasicTCP.java:94)
        at org.jgroups.protocols.TP.doSend(TP.java:1560)
        at org.jgroups.protocols.TP$TransferQueueBundler.sendBundledMessages(TP.java:2329)
        at org.jgroups.protocols.TP$TransferQueueBundler.sendMessages(TP.java:2261)
        at org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2246)
        at java.lang.Thread.run(Thread.java:662)

      The netstat also shows it keeps trying to connect to that node C (192.20.20.33)

      [root@lnx1 ~]# netstat -an| grep 7800
      tcp 0 0 192.20.20.233:7800 0.0.0.0:* LISTEN
      tcp 0 1 192.20.20.233:53070 192.20.20.33:7800 SYN_SENT
      tcp 0 0 192.20.20.233:34237 192.20.20.133:7800 ESTABLISHED
      [root@lnx1 ~]# netstat -an| grep 7800
      tcp 0 0 192.20.20.233:7800 0.0.0.0:* LISTEN
      tcp 0 1 192.20.20.233:36345 192.20.20.33:7800 SYN_SENT
      tcp 0 0 192.20.20.233:34237 192.20.20.133:7800 ESTABLISHED
      [root@lnx1 ~]# netstat -an| grep 7800
      tcp 0 0 192.20.20.233:7800 0.0.0.0:* LISTEN
      tcp 0 1 192.20.20.233:51724 192.20.20.33:7800 SYN_SENT
      tcp 0 113287 192.20.20.233:34237 192.20.20.133:7800 ESTABLISHED
      [root@lnx1 ~]# netstat -an| grep 7800
      tcp 0 0 192.20.20.233:7800 0.0.0.0:* LISTEN
      tcp 0 1 192.20.20.233:51724 192.20.20.33:7800 SYN_SENT
      tcp 0 0 192.20.20.233:34237 192.20.20.133:7800 ESTABLISHED
      [root@lnx1 ~]# netstat -an| grep 7800
      tcp 0 0 192.20.20.233:7800 0.0.0.0:* LISTEN
      tcp 0 1 192.20.20.233:35389 192.20.20.33:7800 SYN_SENT
      tcp 0 0 192.20.20.233:34237 192.20.20.133:7800 ESTABLISHED

      With bundling disabled (had to use 3.1 version), the following stack trace shows where sender thread keeps trying to connect to node C.

      "sender-0" prio=10 tid=0x000000004cc1e800 nid=0x429 runnable [0x00002afe67973000]
      java.lang.Thread.State: RUNNABLE
      at java.net.PlainSocketImpl.socketConnect(Native Method)
      at java.net.PlainSocketImpl.doConnect(Unknown Source)

      • locked <0x00000000ec6046a0> (a java.net.SocksSocketImpl)
        at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
        at java.net.PlainSocketImpl.connect(Unknown Source)
        at java.net.SocksSocketImpl.connect(Unknown Source)
        at java.net.Socket.connect(Unknown Source)
        at org.jgroups.util.Util.connect(Util.java:305)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:388)
        at org.jgroups.blocks.TCPConnectionMap$Mapper.getConnection(TCPConnectionMap.java:785)
        at org.jgroups.blocks.TCPConnectionMap.send(TCPConnectionMap.java:174)
        at org.jgroups.protocols.TCP.send(TCP.java:56)
        at org.jgroups.protocols.BasicTCP.sendUnicast(BasicTCP.java:86)
        at org.jgroups.protocols.TP.sendToAllPhysicalAddresses(TP.java:1348)
        at org.jgroups.protocols.BasicTCP.sendMulticast(BasicTCP.java:81)
        at org.jgroups.protocols.TP.doSend(TP.java:1296)
        at org.jgroups.protocols.TP.send(TP.java:1285)
        at org.jgroups.protocols.TP.down(TP.java:1143)
        at org.jgroups.protocols.Discovery.down(Discovery.java:573)
        at org.jgroups.protocols.MERGE3.down(MERGE3.java:242)
        at org.jgroups.protocols.FD.down(FD.java:308)
        at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:80)
        at org.jgroups.protocols.pbcast.NAKACK.send(NAKACK.java:667)
        at org.jgroups.protocols.pbcast.NAKACK.send(NAKACK.java:678)
        at org.jgroups.protocols.pbcast.NAKACK.down(NAKACK.java:459)
        at org.jgroups.protocols.UNICAST2.down(UNICAST2.java:531)
        at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:328)
        at org.jgroups.protocols.pbcast.GMS.down(GMS.java:968)
        at org.jgroups.protocols.FlowControl.down(FlowControl.java:351)
        at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:116)
        at org.jgroups.protocols.FlowControl.down(FlowControl.java:341)
        at org.jgroups.protocols.FRAG2.down(FRAG2.java:147)
        at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025)
        at org.jgroups.JChannel.down(JChannel.java:729)
        at org.jgroups.JChannel.send(JChannel.java:445)
        at org.jgroups.tests.perf.MPerf$Sender.run(MPerf.java:564)

              rhn-engineering-bban Bela Ban
              ramky74_jira Ramky Kandula (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: