Brian Stansberry said:
One of the EJB3/AS integration testsuite runs hit a JGroups problem last
night. Looks like some issue between the two AS nodes (127.0.0.1 and
127.0.0.2) in the test caused the channels in one node to be
suspected/reconnect. Which didn't go well – which is the subject of
this thread.
I have the server logs from both, which record at INFO or above; if
anyone wants them, ping me. Here are the highlights:
1) Both nodes connect 4 channels as part of AS startup: 3 using a shared
transport UDP stack (for JBC, HAPartition, JBM) and one TCP one for JBM.
2) A bunch of test cases then execute, each deploys an EJB3 jar on each
node in turn, which in turn causes another JBC channel to be created and
connect at test start and disconnect and close at test end. This channel
uses the same shared transport as the other 3, so now there are four
channels using that shared transport.
3) This create/close cycle continues a while, until on one node there is
some issue that results in one of the 4 UDP channels suspecting its peer
and all 4 UDP channels getting a new view:
13:23:58,150 INFO [DefaultPartition] Suspected member: 127.0.0.1:40089
13:23:58,257 INFO [GroupMember]
org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener@4cfcb4de
got new view [127.0.0.2:42088|2] [127.0.0.2:42088], old view is
[127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088]
13:23:58,257 INFO [GroupMember] I am (127.0.0.2:42088)
13:23:58,265 INFO [DefaultPartition] New cluster view for partition
DefaultPartition (id: 2, delta: -1) : [127.0.0.2:1099]
13:23:58,275 INFO [MessagingPostOffice] JBoss Messaging is failing over
for failed node 0. If there are many messages to reload this may take
some time...
13:23:58,442 INFO [DefaultPartition] I am (127.0.0.2:1099) received
membershipChanged event:
13:23:58,442 INFO [DefaultPartition] Dead members: 1 ([127.0.0.1:1099])
13:23:58,442 INFO [DefaultPartition] New Members : 0 ([])
13:23:58,442 INFO [DefaultPartition] All Members : 1 ([127.0.0.2:1099])
13:23:58,581 INFO [RPCManagerImpl] Received new cluster view:
[127.0.0.2:42088|2] [127.0.0.2:42088]
13:23:58,620 INFO [RPCManagerImpl] Received new cluster view:
[127.0.0.2:42088|2] [127.0.0.2:42088]
13:24:00,260 WARN [GMS] 127.0.0.2:42088 failed to collect all ACKs (1)
for mcasted view [127.0.0.2:42088|2] [127.0.0.2:42088] after 2000ms,
missing ACKs from [127.0.0.2:42088] (received=[]),
local_addr=127.0.0.2:42088
13:24:00,661 INFO [MessagingPostOffice] JBoss Messaging failover completed
13:24:00,662 INFO [GroupMember] Dead members: 1 ([127.0.0.1:40089])
13:24:00,662 INFO [GroupMember] All Members : 1 ([127.0.0.2:42088])
4) At some point the suspected node realizes it is suspected and the 4
UDP channels decide to reconnect, two at about the same time and the
other two w/in 55 ms of the first two:
13:24:39,294 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
SUSPECT message and sending back a HEARTBEAT_ACK
13:24:39,295 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
SUSPECT message and sending back a HEARTBEAT_ACK
13:24:39,315 ERROR [ConnectionTable] failed sending data to
127.0.0.2:7900: java.net.SocketException: Broken pipe
13:24:39,325 WARN [GMS] I (127.0.0.1:40089) am not a member of view
[127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
13:24:39,325 WARN [GMS] I (127.0.0.1:40089) am not a member of view
[127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
13:24:39,339 WARN [GMS] I (127.0.0.1:40089) am not a member of view
[127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
13:24:39,359 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
SUSPECT message and sending back a HEARTBEAT_ACK
13:24:39,360 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
SUSPECT message and sending back a HEARTBEAT_ACK
13:24:39,380 WARN [GMS] I (127.0.0.1:40089) am not a member of view
[127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
5) This is where things break down. It looks like one of the 4 channels
is able to get a local_addr, but the others get null. Perhaps a
concurrency bug in Configurator.startProtocolStack? (Note the first 2
lines below have to do with the independent TCP channel used by JBM; I
post them for completeness.)
13:24:39,568 ERROR [ConnectionTable] failed sending data to
127.0.0.2:7900: java.net.SocketException: Broken pipe
13:24:39,959 ERROR [ConnectionTable] failed sending data to
127.0.0.2:7900: java.net.SocketException: Broken pipe
13:24:40,833 WARN [GMS]
13:24:40,834 WARN [GMS]
13:24:41,362 INFO [STDOUT]
---------------------------------------------------------
GMS: address is 127.0.0.1:56000 (cluster=DefaultPartition-SFSBCache)
---------------------------------------------------------
13:24:41,362 FATAL [JChannel] local_addr is null; cannot connect
13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
org.jgroups.ChannelException: local_addr is null
at org.jgroups.JChannel.startStack(JChannel.java:1631)
at org.jgroups.JChannel.connect(JChannel.java:366)
at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
org.jgroups.ChannelException: local_addr is null
at org.jgroups.JChannel.startStack(JChannel.java:1631)
at org.jgroups.JChannel.connect(JChannel.java:366)
at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
13:24:41,488 INFO [RPCManagerImpl] Received new cluster view:
[127.0.0.2:42088|3] [127.0.0.2:42088, 127.0.0.1:56000]
13:24:41,564 FATAL [JChannel] local_addr is null; cannot connect
13:24:41,564 ERROR [JChannel] failure reconnecting to channel, retrying
org.jgroups.ChannelException: local_addr is null
at org.jgroups.JChannel.startStack(JChannel.java:1631)
at org.jgroups.JChannel.connect(JChannel.java:366)
at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
13:24:41,927 INFO [LegacyStateTransferIntegrator] Using version 4096
13:24:45,941 WARN [GMS] Merge aborted. Merge leader did not get
MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
13:24:45,993 WARN [GMS] Merge aborted. Merge leader did not get
MergeData from all subgroup coordinators [127.0.0.2:7900, 127.0.0.1:7900]
13:24:45,995 WARN [GMS] merge was supposed to be cancelled at merge
participant 127.0.0.1:7900 (merge_id=[127.0.0.1:7900|1246037080987]),
but it is not since merge ids do not match
13:24:46,029 WARN [GMS] Merge aborted. Merge leader did not get
MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
13:24:46,029 WARN [GMS] resume([127.0.0.1:40089|1246037080833]) does
not match null
6) Thereafter, there is continual looping of the CloserThread as it
tries to connect a channel, fails due to local_addr == null, waits 1 sec
and tries again, which goes on for hours:
13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
org.jgroups.ChannelException: local_addr is null
at org.jgroups.JChannel.startStack(JChannel.java:1631)
at org.jgroups.JChannel.connect(JChannel.java:366)
at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
As I said, I suspect a concurrency issue that's resulting in one channel
getting a local_addr but thereafter others with the same shared
transport get null.
Hmm, just noticed: the looping in #6 above repeats every 30 secs, not
every second. Which is the timeout in JChannel.startStack() :
/* Wait LOCAL_ADDR_TIMEOUT milliseconds for local_addr to have a
non-null value (set by SET_LOCAL_ADDRESS) */
local_addr=local_addr_promise.getResult(LOCAL_ADDR_TIMEOUT);
if(local_addr == null) {
log.fatal("local_addr is null; cannot connect");
throw new ChannelException("local_addr is null");
}
So, seems the recurring problem is local_addr_promise isn't getting set,
not that it is getting set, but with a null value. However the first
set of FATAL logging occurs just a couple seconds after GMS logs it is
going to reconnect. So that first issue where local_addr == null is not
due to a timeout.