Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2727

TCP_NIO2: Discovery fails due to NoConnectionPendingException when invoking SocketChannel.finishConnect() in NioBaseServer.Acceptor

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 5.3, 5.2.19, 4.2.24
    • 5.1.9, 5.2.16
    • None
    • False
    • None
    • False
    • Moderate

    Description

      Operating System:
      NAME="SLES"
      VERSION="12-SP3"
      VERSION_ID="12.3"
      PRETTY_NAME="SUSE Linux Enterprise Server 12 SP3"
      ID="sles"
      ANSI_COLOR="0;32"
      CPE_NAME="cpe:/o:suse:sles:12:sp3"
       
       
      JAVA Version:
      openjdk version "11.0.17" 2022-10-18
      OpenJDK Runtime Environment (build 11.0.17+0-suse-3.49.2-x8664)
      OpenJDK 64-Bit Server VM (build 11.0.17+0-suse-3.49.2-x8664, mixed mode)
       
       
      JGroups Version:
      5.2.16.final
       
      Protocol Stack:
      https://github.com/belaban/JGroups/blob/jgroups-5.2.16.Final/conf/tcp-nio.xml
       
      Issue & Background:
      Recently we were testing our application by integrating JGroups 5.2.16.final. The protocol stack used is the default tcp-nio.xml available under above path.
       
      Initial hosts used is as below
      -Djgroups.tcpping.initial_hosts=NodeA[7800],NodeB[7800]
       
      The Node start Order is NodeA, NodeB
       
      The issue is that when NodeB starts it is not able to discover NodeA  and hence forms its own cluster. The TRACE  logs reported ClosedChannelException on NodeA and NodeB during discovery.
       
      TRACE 2023-08-25T05:02:03,849 [jgroups-5,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCPPING - bss-esa-0011-54011 (flags=32767): sending discovery request to 192.168.1.12:7800
      TRACE 2023-08-25T05:02:03,850 [jgroups-5,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCP_NIO2 - bss-esa-0011-54011 (flags=32767): sending msg to 192.168.1.12:7800, src=bss-esa-0011-54011 (flags=32767), size=177, headers are TCPPING: [GET_MBRS_REQ cluster=FmCluster initial_discovery=true], TP: [cluster=FmCluster]
      TRACE 2023-08-25T05:02:03,868 [TQ-Bundler-4,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCP_NIO2 - 192.168.1.11:7800: connecting to 192.168.1.12:7800
      TRACE 2023-08-25T05:02:03,870 [TQ-Bundler-4,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCP_NIO2 - 192.168.1.11:7800: failed connecting to 192.168.1.12:7800: java.nio.channels.ClosedChannelException
      java.nio.channels.ClosedChannelException: null
      at sun.nio.ch.SocketChannelImpl.ensureOpen(SocketChannelImpl.java:160) ~[?:?]
      at sun.nio.ch.SocketChannelImpl.beginConnect(SocketChannelImpl.java:629) ~[?:?]
      at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:690) ~[?:?]
      at org.jgroups.util.Util.connect(Util.java:444) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.blocks.cs.NioConnection.connect(NioConnection.java:153) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.blocks.cs.NioConnection.connect(NioConnection.java:144) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.blocks.cs.BaseServer.getConnection(BaseServer.java:332) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.blocks.cs.BaseServer.getConnection(BaseServer.java:298) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.blocks.cs.BaseServer.send(BaseServer.java:238) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.protocols.TCP_NIO2.send(TCP_NIO2.java:92) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.protocols.BasicTCP.sendUnicast(BasicTCP.java:155) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.protocols.TP.sendTo(TP.java:1359) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.protocols.TP.doSend(TP.java:1339) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.protocols.BaseBundler.sendSingleMessage(BaseBundler.java:127) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.protocols.BaseBundler.sendBundledMessages(BaseBundler.java:108) ~[jgroups.jar:5.2.16.Final]
      at org.jgroups.protocols.TransferQueueBundler.run(TransferQueueBundler.java:118) ~[jgroups.jar:5.2.16.Final]
      at java.lang.Thread.run(Thread.java:829) ~[?:?]
      TRACE 2023-08-25T05:02:03,891 [TQ-Bundler-4,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCP_NIO2 - 192.168.1.11:7800: removed connection to 192.168.1.12:7800
       INFO 2023-08-25T05:02:05,845 [FmAgent] org.jgroups.protocols.pbcast.GMS - bss-esa-0011-54011 (flags=32767): no members discovered after 2001 ms: creating cluster as coordinator
       
      Further, I attempted to run the org.jgroups.demos.Draw example with the above said configuration and the same issue is observed.
       
      I tried to add few more logs to org.jgroups.blocks.cs.NioBaseServer to see why the channel moves to closed state. It looks like the Acceptor thread in NioBaseServer closes the channel after hitting a NoConnectionPendingException on trying to invoke channel.finishConnect(ln: 139).
       
      As I see, the invocation of channel.connnect() is done by TQ-Bundler thread and channel.finishConnect() is done by Acceptor thread. Hence at race conditions, after Acceptor thread selects a channel it invokes the finishConnect() method even before the TQ-bundler thread invokes the channel.connect leading to NoConnectionPendingException.
       
      Where it works:
         <= 5.1.8, <= 4.2.15
      Where it does not work:
         > 5.1.8,  > 4.2.15
         
      Running Draw Demo with additional logs: (Issue Reproduced)
       
      bss-esa-0011:/home # java -Djava.net.preferIPv4Stack=true -Djgroups.bind_addr=192.168.1.11 -Djgroups.tcpping.initial_hosts=192.168.1.11[7800],192.168.1.12[7800] -cp jgroups.jar org.jgroups.demos.Draw -props /tmp/tcp-nio-j5.xml 
      Aug 25, 2023 5:44:26 AM org.jgroups.JChannel setAddress
      INFO: local_addr: cb25d405-8849-4bcc-b6ae-55fc0e1cc544, name: bss-esa-0011-32953
       
      -------------------------------------------------------------------
      GMS: address=bss-esa-0011-32953, cluster=draw, physical address=192.168.1.11:7800
      -------------------------------------------------------------------
      Aug 25, 2023 5:44:27 AM org.jgroups.blocks.cs.NioBaseServer$Acceptor run
      SEVERE: Closing Connection. Hit an Exception
      java.nio.channels.NoConnectionPendingException
      at java.base/sun.nio.ch.SocketChannelImpl.beginFinishConnect(SocketChannelImpl.java:726)
      at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:770)
      at org.jgroups.blocks.cs.NioBaseServer$Acceptor.run(NioBaseServer.java:153)
      at java.base/java.lang.Thread.run(Thread.java:829)
       
      Aug 25, 2023 5:44:29 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
      INFO: bss-esa-0011-32953: no members discovered after 2005 ms: creating cluster as coordinator
      ** View=[bss-esa-0011-32953|0] (1) [bss-esa-0011-32953]
      ** MergeView::[bss-esa-0011-32953|1] (2) [bss-esa-0011-32953, bss-esa-0012-39102], 2 subgroups: [bss-esa-0012-39102|0] (1) [bss-esa-0012-39102], [bss-esa-0011-32953|0] (1) [bss-esa-0011-32953]
       
      bss-esa-0012:/home # java -Djava.net.preferIPv4Stack=true -Djgroups.bind_addr=192.168.1.12 -Djgroups.tcpping.initial_hosts=192.168.1.11[7800],192.168.1.12[7800] -cp jgroups.jar org.jgroups.demos.Draw -props /tmp/tcp-nio-j5.xml 
      Aug 25, 2023 5:44:27 AM org.jgroups.JChannel setAddress
      INFO: local_addr: e7185603-104b-4cff-8947-412ae1dcf6a2, name: bss-esa-0012-39102
       
      -------------------------------------------------------------------
      GMS: address=bss-esa-0012-39102, cluster=draw, physical address=192.168.1.12:7800
      -------------------------------------------------------------------
      Aug 25, 2023 5:44:30 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
      INFO: bss-esa-0012-39102: no members discovered after 2006 ms: creating cluster as coordinator
      ** View=[bss-esa-0012-39102|0] (1) [bss-esa-0012-39102]
      ** MergeView::[bss-esa-0011-32953|1] (2) [bss-esa-0011-32953, bss-esa-0012-39102], 2 subgroups: [bss-esa-0012-39102|0] (1) [bss-esa-0012-39102], [bss-esa-0011-32953|0] (1) [bss-esa-0011-32953]
       
      Additional Note:
      The issue is observed only when using TCP_NIO2 as transport alongside any discovery protocol. It is not observed when using with TCP/UDP transport.

       

      Possible Fix:

      One possible fix will be to include an additional check to see if the connection is pending before invoking ch.finishConnect()

      if ((ch.isConnectionPending() && ch.finishConnect()) || ch.isConnected())

      {     conn.clearSelectionKey(SelectionKey.OP_CONNECT);     conn.connected(true); }

      Attachments

        Issue Links

          Activity

            People

              rhn-engineering-bban Bela Ban
              arunkarthik.s Arunkarthik Selvaraj (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: