Operating System:
NAME="SLES"
VERSION="12-SP3"
VERSION_ID="12.3"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP3"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp3"
JAVA Version:
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+0-suse-3.49.2-x8664)
OpenJDK 64-Bit Server VM (build 11.0.17+0-suse-3.49.2-x8664, mixed mode)
JGroups Version:
5.2.16.final
Protocol Stack:
https://github.com/belaban/JGroups/blob/jgroups-5.2.16.Final/conf/tcp-nio.xml
Issue & Background:
Recently we were testing our application by integrating JGroups 5.2.16.final. The protocol stack used is the default tcp-nio.xml available under above path.
Initial hosts used is as below
-Djgroups.tcpping.initial_hosts=NodeA[7800],NodeB[7800]
The Node start Order is NodeA, NodeB
The issue is that when NodeB starts it is not able to discover NodeA and hence forms its own cluster. The TRACE logs reported ClosedChannelException on NodeA and NodeB during discovery.
TRACE 2023-08-25T05:02:03,849 [jgroups-5,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCPPING - bss-esa-0011-54011 (flags=32767): sending discovery request to 192.168.1.12:7800
TRACE 2023-08-25T05:02:03,850 [jgroups-5,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCP_NIO2 - bss-esa-0011-54011 (flags=32767): sending msg to 192.168.1.12:7800, src=bss-esa-0011-54011 (flags=32767), size=177, headers are TCPPING: [GET_MBRS_REQ cluster=FmCluster initial_discovery=true], TP: [cluster=FmCluster]
TRACE 2023-08-25T05:02:03,868 [TQ-Bundler-4,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCP_NIO2 - 192.168.1.11:7800: connecting to 192.168.1.12:7800
TRACE 2023-08-25T05:02:03,870 [TQ-Bundler-4,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCP_NIO2 - 192.168.1.11:7800: failed connecting to 192.168.1.12:7800: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException: null
at sun.nio.ch.SocketChannelImpl.ensureOpen(SocketChannelImpl.java:160) ~[?:?]
at sun.nio.ch.SocketChannelImpl.beginConnect(SocketChannelImpl.java:629) ~[?:?]
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:690) ~[?:?]
at org.jgroups.util.Util.connect(Util.java:444) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.blocks.cs.NioConnection.connect(NioConnection.java:153) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.blocks.cs.NioConnection.connect(NioConnection.java:144) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.blocks.cs.BaseServer.getConnection(BaseServer.java:332) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.blocks.cs.BaseServer.getConnection(BaseServer.java:298) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.blocks.cs.BaseServer.send(BaseServer.java:238) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.protocols.TCP_NIO2.send(TCP_NIO2.java:92) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.protocols.BasicTCP.sendUnicast(BasicTCP.java:155) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.protocols.TP.sendTo(TP.java:1359) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.protocols.TP.doSend(TP.java:1339) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.protocols.BaseBundler.sendSingleMessage(BaseBundler.java:127) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.protocols.BaseBundler.sendBundledMessages(BaseBundler.java:108) ~[jgroups.jar:5.2.16.Final]
at org.jgroups.protocols.TransferQueueBundler.run(TransferQueueBundler.java:118) ~[jgroups.jar:5.2.16.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
TRACE 2023-08-25T05:02:03,891 [TQ-Bundler-4,FmCluster,bss-esa-0011-54011 (flags=32767)] org.jgroups.protocols.TCP_NIO2 - 192.168.1.11:7800: removed connection to 192.168.1.12:7800
INFO 2023-08-25T05:02:05,845 [FmAgent] org.jgroups.protocols.pbcast.GMS - bss-esa-0011-54011 (flags=32767): no members discovered after 2001 ms: creating cluster as coordinator
Further, I attempted to run the org.jgroups.demos.Draw example with the above said configuration and the same issue is observed.
I tried to add few more logs to org.jgroups.blocks.cs.NioBaseServer to see why the channel moves to closed state. It looks like the Acceptor thread in NioBaseServer closes the channel after hitting a NoConnectionPendingException on trying to invoke channel.finishConnect(ln: 139).
As I see, the invocation of channel.connnect() is done by TQ-Bundler thread and channel.finishConnect() is done by Acceptor thread. Hence at race conditions, after Acceptor thread selects a channel it invokes the finishConnect() method even before the TQ-bundler thread invokes the channel.connect leading to NoConnectionPendingException.
Where it works:
<= 5.1.8, <= 4.2.15
Where it does not work:
> 5.1.8, > 4.2.15
Running Draw Demo with additional logs: (Issue Reproduced)
bss-esa-0011:/home # java -Djava.net.preferIPv4Stack=true -Djgroups.bind_addr=192.168.1.11 -Djgroups.tcpping.initial_hosts=192.168.1.11[7800],192.168.1.12[7800] -cp jgroups.jar org.jgroups.demos.Draw -props /tmp/tcp-nio-j5.xml
Aug 25, 2023 5:44:26 AM org.jgroups.JChannel setAddress
INFO: local_addr: cb25d405-8849-4bcc-b6ae-55fc0e1cc544, name: bss-esa-0011-32953
-------------------------------------------------------------------
GMS: address=bss-esa-0011-32953, cluster=draw, physical address=192.168.1.11:7800
-------------------------------------------------------------------
Aug 25, 2023 5:44:27 AM org.jgroups.blocks.cs.NioBaseServer$Acceptor run
SEVERE: Closing Connection. Hit an Exception
java.nio.channels.NoConnectionPendingException
at java.base/sun.nio.ch.SocketChannelImpl.beginFinishConnect(SocketChannelImpl.java:726)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:770)
at org.jgroups.blocks.cs.NioBaseServer$Acceptor.run(NioBaseServer.java:153)
at java.base/java.lang.Thread.run(Thread.java:829)
Aug 25, 2023 5:44:29 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
INFO: bss-esa-0011-32953: no members discovered after 2005 ms: creating cluster as coordinator
** View=[bss-esa-0011-32953|0] (1) [bss-esa-0011-32953]
** MergeView::[bss-esa-0011-32953|1] (2) [bss-esa-0011-32953, bss-esa-0012-39102], 2 subgroups: [bss-esa-0012-39102|0] (1) [bss-esa-0012-39102], [bss-esa-0011-32953|0] (1) [bss-esa-0011-32953]
bss-esa-0012:/home # java -Djava.net.preferIPv4Stack=true -Djgroups.bind_addr=192.168.1.12 -Djgroups.tcpping.initial_hosts=192.168.1.11[7800],192.168.1.12[7800] -cp jgroups.jar org.jgroups.demos.Draw -props /tmp/tcp-nio-j5.xml
Aug 25, 2023 5:44:27 AM org.jgroups.JChannel setAddress
INFO: local_addr: e7185603-104b-4cff-8947-412ae1dcf6a2, name: bss-esa-0012-39102
-------------------------------------------------------------------
GMS: address=bss-esa-0012-39102, cluster=draw, physical address=192.168.1.12:7800
-------------------------------------------------------------------
Aug 25, 2023 5:44:30 AM org.jgroups.protocols.pbcast.ClientGmsImpl joinInternal
INFO: bss-esa-0012-39102: no members discovered after 2006 ms: creating cluster as coordinator
** View=[bss-esa-0012-39102|0] (1) [bss-esa-0012-39102]
** MergeView::[bss-esa-0011-32953|1] (2) [bss-esa-0011-32953, bss-esa-0012-39102], 2 subgroups: [bss-esa-0012-39102|0] (1) [bss-esa-0012-39102], [bss-esa-0011-32953|0] (1) [bss-esa-0011-32953]
Additional Note:
The issue is observed only when using TCP_NIO2 as transport alongside any discovery protocol. It is not observed when using with TCP/UDP transport.
Possible Fix:
One possible fix will be to include an additional check to see if the connection is pending before invoking ch.finishConnect()
if ((ch.isConnectionPending() && ch.finishConnect()) || ch.isConnected())
{ conn.clearSelectionKey(SelectionKey.OP_CONNECT); conn.connected(true); }- is caused by
-
JGRP-2531 failed handling message java.nio.channels.NotYetConnectedException: null after upgrade from jgroups-5.0.4 to jgroups-5.1.5
- Resolved