Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1182

GET_MBRS_RSP are not all processed, Discovery step ends prematurely.

    Details

      Description

      I launch successively (nearly simultaneously) 5 nodes A B C D E on 5 hosts using the same protocol stack and one channel to communicate between themselves.

      UDP(mcast_addr=231.8.8.8;mcast_port=45578):PING(num_initial_members=5;timeout=800):MERGE2:FD:VERIFY_SUSPECT:pbcast.NAKACK:pbcast.STABLE:FRAG2:pbcast.GMS:pbcast.FLUSH

      Discovery sends up to n GET_MBRS_REQ to discover the members. Each GET_MBRS_REQ triggers a round of GET_MBRS_RSP which increases the initial_member count up to its limit in the Promise blocking the discovery. One GET_MBRS_RSP round may not be sufficient to discover all the members, the second RSP round then completes the count of the Promise, but depending on the order of RSP reception, the Promise condition may be signalled before all the RSP are processed, and those unprocessed RSP may belong to a Coordinator elected between the two REQ sent. => trouble.

      exemple:
      A B C D E are launched
      ...
      D sends GET_MBRS_REQ
      D receives 4 GET_MBRS_RSP from D A B C
      A becomes coordinator
      D sends GET_MBRS_REQ 400ms after the first
      D receives B GET_MBRS_RSP
      D receives E GET_MBRS_RSP and meets the discovery initial_members. Discovery ends in 428ms
      D receives A GET_MBRS_RSP A is coordinator but it's too late, it won't be counted in the set of responses
      D becomes coordinator.

      We have two coordinators.

      It may happen also if E is quicker and is part of the first RSP round.

      I am not sure yet of how to solve this problem. Obviously D should have been warned A was becoming coordinator or A was trying to at least.
      Perhaps if all the GET_MBRS traffic was multicast, each new member could spy it and try according the different REQ and RSP message find who is doing what.

      I'd see well discovery split in two phase, on phase where a new member would "silently" listen to the network then actively try to discover the other member with several GET_MBRS_REQ.

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                belaban Bela Ban
                Reporter:
                rddx Renaud Devarieux
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: