Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1006

Shared transport: shunned multiple channels can not reconnect

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.6.11, 2.8
    • 2.6
    • None

      Brian Stansberry said:

      One of the EJB3/AS integration testsuite runs hit a JGroups problem last
      night. Looks like some issue between the two AS nodes (127.0.0.1 and
      127.0.0.2) in the test caused the channels in one node to be
      suspected/reconnect. Which didn't go well – which is the subject of
      this thread.

      I have the server logs from both, which record at INFO or above; if
      anyone wants them, ping me. Here are the highlights:

      1) Both nodes connect 4 channels as part of AS startup: 3 using a shared
      transport UDP stack (for JBC, HAPartition, JBM) and one TCP one for JBM.

      2) A bunch of test cases then execute, each deploys an EJB3 jar on each
      node in turn, which in turn causes another JBC channel to be created and
      connect at test start and disconnect and close at test end. This channel
      uses the same shared transport as the other 3, so now there are four
      channels using that shared transport.

      3) This create/close cycle continues a while, until on one node there is
      some issue that results in one of the 4 UDP channels suspecting its peer
      and all 4 UDP channels getting a new view:

      13:23:58,150 INFO [DefaultPartition] Suspected member: 127.0.0.1:40089
      13:23:58,257 INFO [GroupMember]
      org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener@4cfcb4de
      got new view [127.0.0.2:42088|2] [127.0.0.2:42088], old view is
      [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088]
      13:23:58,257 INFO [GroupMember] I am (127.0.0.2:42088)
      13:23:58,265 INFO [DefaultPartition] New cluster view for partition
      DefaultPartition (id: 2, delta: -1) : [127.0.0.2:1099]
      13:23:58,275 INFO [MessagingPostOffice] JBoss Messaging is failing over
      for failed node 0. If there are many messages to reload this may take
      some time...
      13:23:58,442 INFO [DefaultPartition] I am (127.0.0.2:1099) received
      membershipChanged event:
      13:23:58,442 INFO [DefaultPartition] Dead members: 1 ([127.0.0.1:1099])
      13:23:58,442 INFO [DefaultPartition] New Members : 0 ([])
      13:23:58,442 INFO [DefaultPartition] All Members : 1 ([127.0.0.2:1099])
      13:23:58,581 INFO [RPCManagerImpl] Received new cluster view:
      [127.0.0.2:42088|2] [127.0.0.2:42088]
      13:23:58,620 INFO [RPCManagerImpl] Received new cluster view:
      [127.0.0.2:42088|2] [127.0.0.2:42088]
      13:24:00,260 WARN [GMS] 127.0.0.2:42088 failed to collect all ACKs (1)
      for mcasted view [127.0.0.2:42088|2] [127.0.0.2:42088] after 2000ms,
      missing ACKs from [127.0.0.2:42088] (received=[]),
      local_addr=127.0.0.2:42088
      13:24:00,661 INFO [MessagingPostOffice] JBoss Messaging failover completed
      13:24:00,662 INFO [GroupMember] Dead members: 1 ([127.0.0.1:40089])
      13:24:00,662 INFO [GroupMember] All Members : 1 ([127.0.0.2:42088])

      4) At some point the suspected node realizes it is suspected and the 4
      UDP channels decide to reconnect, two at about the same time and the
      other two w/in 55 ms of the first two:

      13:24:39,294 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
      SUSPECT message and sending back a HEARTBEAT_ACK
      13:24:39,295 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
      SUSPECT message and sending back a HEARTBEAT_ACK
      13:24:39,315 ERROR [ConnectionTable] failed sending data to
      127.0.0.2:7900: java.net.SocketException: Broken pipe
      13:24:39,325 WARN [GMS] I (127.0.0.1:40089) am not a member of view
      [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
      group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
      is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
      13:24:39,325 WARN [GMS] I (127.0.0.1:40089) am not a member of view
      [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
      group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
      is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
      13:24:39,339 WARN [GMS] I (127.0.0.1:40089) am not a member of view
      [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
      group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
      is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
      13:24:39,359 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
      SUSPECT message and sending back a HEARTBEAT_ACK
      13:24:39,360 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
      SUSPECT message and sending back a HEARTBEAT_ACK
      13:24:39,380 WARN [GMS] I (127.0.0.1:40089) am not a member of view
      [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
      group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
      is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])

      5) This is where things break down. It looks like one of the 4 channels
      is able to get a local_addr, but the others get null. Perhaps a
      concurrency bug in Configurator.startProtocolStack? (Note the first 2
      lines below have to do with the independent TCP channel used by JBM; I
      post them for completeness.)

      13:24:39,568 ERROR [ConnectionTable] failed sending data to
      127.0.0.2:7900: java.net.SocketException: Broken pipe
      13:24:39,959 ERROR [ConnectionTable] failed sending data to
      127.0.0.2:7900: java.net.SocketException: Broken pipe
      13:24:40,833 WARN [GMS]
      13:24:40,834 WARN [GMS]
      13:24:41,362 INFO [STDOUT]
      ---------------------------------------------------------
      GMS: address is 127.0.0.1:56000 (cluster=DefaultPartition-SFSBCache)
      ---------------------------------------------------------
      13:24:41,362 FATAL [JChannel] local_addr is null; cannot connect
      13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
      13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
      org.jgroups.ChannelException: local_addr is null
      at org.jgroups.JChannel.startStack(JChannel.java:1631)
      at org.jgroups.JChannel.connect(JChannel.java:366)
      at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
      13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
      org.jgroups.ChannelException: local_addr is null
      at org.jgroups.JChannel.startStack(JChannel.java:1631)
      at org.jgroups.JChannel.connect(JChannel.java:366)
      at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
      13:24:41,488 INFO [RPCManagerImpl] Received new cluster view:
      [127.0.0.2:42088|3] [127.0.0.2:42088, 127.0.0.1:56000]
      13:24:41,564 FATAL [JChannel] local_addr is null; cannot connect
      13:24:41,564 ERROR [JChannel] failure reconnecting to channel, retrying
      org.jgroups.ChannelException: local_addr is null
      at org.jgroups.JChannel.startStack(JChannel.java:1631)
      at org.jgroups.JChannel.connect(JChannel.java:366)
      at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
      13:24:41,927 INFO [LegacyStateTransferIntegrator] Using version 4096
      13:24:45,941 WARN [GMS] Merge aborted. Merge leader did not get
      MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
      13:24:45,993 WARN [GMS] Merge aborted. Merge leader did not get
      MergeData from all subgroup coordinators [127.0.0.2:7900, 127.0.0.1:7900]
      13:24:45,995 WARN [GMS] merge was supposed to be cancelled at merge
      participant 127.0.0.1:7900 (merge_id=[127.0.0.1:7900|1246037080987]),
      but it is not since merge ids do not match
      13:24:46,029 WARN [GMS] Merge aborted. Merge leader did not get
      MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
      13:24:46,029 WARN [GMS] resume([127.0.0.1:40089|1246037080833]) does
      not match null

      6) Thereafter, there is continual looping of the CloserThread as it
      tries to connect a channel, fails due to local_addr == null, waits 1 sec
      and tries again, which goes on for hours:

      13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
      13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
      org.jgroups.ChannelException: local_addr is null
      at org.jgroups.JChannel.startStack(JChannel.java:1631)
      at org.jgroups.JChannel.connect(JChannel.java:366)
      at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)

      As I said, I suspect a concurrency issue that's resulting in one channel
      getting a local_addr but thereafter others with the same shared
      transport get null.

      Hmm, just noticed: the looping in #6 above repeats every 30 secs, not
      every second. Which is the timeout in JChannel.startStack() :

      /* Wait LOCAL_ADDR_TIMEOUT milliseconds for local_addr to have a
      non-null value (set by SET_LOCAL_ADDRESS) */
      local_addr=local_addr_promise.getResult(LOCAL_ADDR_TIMEOUT);
      if(local_addr == null) {
      log.fatal("local_addr is null; cannot connect");
      throw new ChannelException("local_addr is null");
      }

      So, seems the recurring problem is local_addr_promise isn't getting set,
      not that it is getting set, but with a null value. However the first
      set of FATAL logging occurs just a couple seconds after GMS logs it is
      going to reconnect. So that first issue where local_addr == null is not
      due to a timeout.

              vblagoje Vladimir Blagojevic (Inactive)
              vblagoje Vladimir Blagojevic (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: