Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1671

It seems TCPConnectionMap didn't restore after network failure

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 3.3.4
    • Fix Version/s: None
    • Labels:
      None

      Description

      I got next exception on node (let say node1).

      WARN [ConnectionMap.Acceptor [xxx.xxx.xxx.xxx:34383],null,null] org.jgroups.protocols.TCP [JGRP00006] failed accepting connection from
      peer: %s
      java.net.SocketTimeoutException: Read timed out
      at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.7.0_17]
      at java.net.SocketInputStream.read(SocketInputStream.java:150) ~[na:1.7.0_17]
      at java.net.SocketInputStream.read(SocketInputStream.java:121) ~[na:1.7.0_17]
      at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) ~[na:1.7.0_17]
      at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) ~[na:1.7.0_17]
      at java.io.BufferedInputStream.read(BufferedInputStream.java:334) ~[na:1.7.0_17]
      at java.io.DataInputStream.readFully(DataInputStream.java:195) ~[na:1.7.0_17]
      at org.jgroups.blocks.TCPConnectionMap$TCPConnection.readPeerAddress(TCPConnectionMap.java:495)
      at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:377)
      at org.jgroups.blocks.TCPConnectionMap$Acceptor.handleAccept(TCPConnectionMap.java:299)
      at org.jgroups.blocks.TCPConnectionMap$Acceptor.run(TCPConnectionMap.java:283)
      at java.lang.Thread.run(Thread.java:722) [na:1.7.0_17]

      After it two nodes works in next way:

      node 1 - sends Discovery requests every 3 seconds:
      [2013-08-05 21:02:00,585] TRACE [TransferQueueBundler,global,_index-subscriber-node01] org.jgroups.protocols.TCPPING _index-subscriber-node01: sending discovery request to xxx.xxx.xxx.xxx:34383

      node 2 - [2013-08-05 21:02:03,791] TRACE [OOB-2,global,_index-subscriber-node02] org.jgroups.protocols.TCPPING _index-subscriber-node02: received GET_MBRS_REQ from _index-subscriber-node01, sending response [PING: type=GET_MBRS_RSP, arg=_index-subscriber-node02, view_id=[_index-subscriber-node03|230], is_server=true, is_coord=false, logical_name=_index-subscriber-node02, physical_addrs=xxx.xxx.xxx.xxx:34383]

      And node 1 - didn't get any response and continue to send discovery request every 3 seconds.

      So it necessary to restart node to restore functionality.

      What is interresting - I see much more simmilar exceptions - and in most cases functionality is restored authomatically. Only few of them break a node.

        Attachments

          Activity

            People

            Assignee:
            belaban Bela Ban
            Reporter:
            igormazur Igor Mazur (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: