I launch successively (nearly simultaneously) 5 nodes A B C D E on 5 hosts using the same protocol stack and one channel to communicate between themselves.
UDP(mcast_addr=231.8.8.8;mcast_port=45578):PING(num_initial_members=5;timeout=800):MERGE2:FD:VERIFY_SUSPECT:pbcast.NAKACK:pbcast.STABLE:FRAG2:pbcast.GMS:pbcast.FLUSH
Discovery sends up to n GET_MBRS_REQ to discover the members. Each GET_MBRS_REQ triggers a round of GET_MBRS_RSP which increases the initial_member count up to its limit in the Promise blocking the discovery. One GET_MBRS_RSP round may not be sufficient to discover all the members, the second RSP round then completes the count of the Promise, but depending on the order of RSP reception, the Promise condition may be signalled before all the RSP are processed, and those unprocessed RSP may belong to a Coordinator elected between the two REQ sent. => trouble.
exemple:
A B C D E are launched
...
D sends GET_MBRS_REQ
D receives 4 GET_MBRS_RSP from D A B C
A becomes coordinator
D sends GET_MBRS_REQ 400ms after the first
D receives B GET_MBRS_RSP
D receives E GET_MBRS_RSP and meets the discovery initial_members. Discovery ends in 428ms
D receives A GET_MBRS_RSP A is coordinator but it's too late, it won't be counted in the set of responses
D becomes coordinator.
We have two coordinators.
It may happen also if E is quicker and is part of the first RSP round.
I am not sure yet of how to solve this problem. Obviously D should have been warned A was becoming coordinator or A was trying to at least.
Perhaps if all the GET_MBRS traffic was multicast, each new member could spy it and try according the different REQ and RSP message find who is doing what.
I'd see well discovery split in two phase, on phase where a new member would "silently" listen to the network then actively try to discover the other member with several GET_MBRS_REQ.
You're right, I had overlooked UNICAST. In the past, I wasn't aware part of the service traffic was unicast and not all multicast. I should put it back.
I have thought a bit now and came with a bunch of ideas.
1. We could wipe the set of GET_MBRS_RSP prior sending a new GET_MBRS_REQ. It would improve a lot but not solve completely the problem.
2. GET_MBRS_RSP could track the number of GET_MBRS_REQ so in the exemple, D would know A B C already sent GET_MBRS_REQ and are perhaps trying to elect themselves. Therefore D should wait.
3. Increase the frequency of GET_MBRS_REQ so it can act as a heartbeat with period p for the duration of the Discovery. Every node waits up to 2p before starting to send its own GET_MBRS_REQ but responds to incoming GET_MBRS_REQ. Obviously, if you reply to one, you won't send any and wait for the other node to do the job and install the view.
A kind of priority must be assigned to each node to break ties if following the same logic each node starts to speak at the same time after the 2p silence.
Thanks for the advice, I agree it may help because of the break_on_coord_rsp. I'll reopen the issue if I manage to come with some java code for the 3.