Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1168

Gossip Router's multiple socket connections /w same TCPGossip causes invalid node list

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.10
    • 2.8, 2.9
    • None

      While testing the fix, JGRP-1164 , we noticed there are still some cases where Gossip Router may publish wrong node list causing node isolation (as join wouldn't happen if coordinator is missing). Here is the scenario when GR may publish wrong node list,

      1) Node A (coordinator) is connected /w Gossip Router
      2) Node A times out while asking for members from GR
      3) RouterStub.getMembers(..) throws exception (Read Timed Out), which causes the state to be changed to DISCONNECTED
      4) The connectionStateChanged(...) calls TCPGossip.connectionStatusChange(..), which calls RouterStub.destroy(...)
      5) The RouterStub.destroy(..) sends the "Close' message to the Gossip Router and then closes the socket connection
      6) TCPGossip starts the reconnector to make new socket connection to the Gossip Router

      Now the problem is at step 5 - as seen in the attached GR log (we've added some custom trace in Gossip Router code to find the problems). The CLOSE message reaches GR after the reconnect has happened (in attached trace, handler-14 thread (ConnectionHandler on GR) is the one which is supposed to be closed, but handler-15 thread starts before handler-14 is stopped). This causes the entry for Node A to be removed when handler-14 close is received, but the socket connection handler-15 is still open and thus, causes Gossip Router to publish the wrong node list (missing Node A).

      Note: We used WANem between Gossip Router and Node A to create random disconnects every 2-3 min. The disconnects would last for 30-45 seconds. There was also 10% packet loss.

      Few Proposed Solutions
      -------------
      1) Gossip Router shouldn't accept a new connection if a connection from that ip address already exists or else remove the old connection and then create the new one. This will guarantee there is only one-to-one relationship between a node and Gossip Router.

      2) Instead of using IP address the Gossip Router can use some sort of id for each connection handler in the node list map. This way we won't delete entries based on ip address, but id (like UUID).

      3) TCPGossip should wait for the acknowledgement of CLOSE message (just like handshake for CONNECT). Only if the CLOSE either fails or succeeds that we should start the reconnector. This can be done in conjunction with solution 1.

        1. tcpgossip_trace.txt
          3 kB
        2. GR_trace.txt
          4 kB
        3. GR_Patch-1168.txt
          11 kB

              vblagoje Vladimir Blagojevic (Inactive)
              vivash vivek v (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: