Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 4.1.8
Affects Version/s: 4.1.6
Labels:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We have a test that starts 2 nodes in parallel (ConcurrentStartTest and it is randomly failing since we started using LOCAL_PING.

01:02:11,930 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: discovery took 3 ms, members: 1 rsps (0 coords) [done]
01:02:11,930 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: discovery took 3 ms, members: 1 rsps (0 coords) [done]
01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: could not determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could not determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: nodes to choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: I (Test-NodeB-43694) am the first of the nodes, will become coordinator
01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: I (Test-NodeA-29550) am not the first of the nodes, waiting for another client to become coordinator
01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: discovery took 0 ms, members: 1 rsps (0 coords) [done]
01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could not determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
... 
01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could not determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: I (Test-NodeA-29550) am not the first of the nodes, waiting for another client to become coordinator
01:02:11,942 WARN  (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: too many JOIN attempts (10): becoming singleton
01:02:11,942 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: installing view [Test-NodeA-29550|0] (1) [Test-NodeA-29550]
01:02:11,977 DEBUG (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: created cluster (first member). My view is [Test-NodeB-43694|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
01:02:11,977 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: created cluster (first member). My view is [Test-NodeA-29550|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl

The problem seems to be that it takes longer for the coordinator to install the initial view and update LOCAL_PING's PingData then it takes the other node to retry the discovery process 10 times.

In some cases there is no retry, because one node starts slightly faster, but it's not yet coordinator when the 2nd node does its discovery, and both nodes decide they should be coordinator:

01:13:44,460 INFO  (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-5386: no members discovered after 3 ms: creating cluster as first member
01:13:44,463 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: discovery took 1 ms, members: 1 rsps (0 coords) [done]
01:13:44,465 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: could not determine coordinator from rsps 1 rsps (0 coords) [done]
01:13:44,465 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: nodes to choose new coord from are: [Test-NodeB-51165, Test-NodeA-5386]
01:13:44,466 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: I (Test-NodeB-51165) am the first of the nodes, will become coordinator
01:13:44,466 DEBUG (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: installing view [Test-NodeB-51165|0] (1) [Test-NodeB-51165]
01:13:44,466 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-5386: installing view [Test-NodeA-5386|0] (1) [Test-NodeA-5386]

This second failure mode seems to go away if I move the discovery map access inside the synchronized block both in findMembers() and in down().

is related to

JGRP-2381 LOCAL_PING should update PingData.isCoord on view changes

Resolved

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates