-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
4.0.10
-
None
Hi
we have 15 node cluster after upgrading Jgroup from 3.4.3 to 4.0.10, system is unstable and keep getting below logs
2020-09-03 11:47:30.317 WARN org.jgroups.protocols.pbcast.GMS - vmc0198-27827: not member of view [vmc0208-48939|123]; discarding it
2020-09-03 11:47:32.316 WARN org.jgroups.protocols.pbcast.GMS - vmc0198-27827: failed to create view from delta-view; dropping view: java.lang.IllegalStateException: the view-id of the delta view ([vmc0208-48939|123]) doesn't match the current view-id ([vmc0208-48939|122]); discarding delta view [vmc0208-48939|124], ref-view=[vmc0208-48939|123], joined=[vmc0198-5504]
2020-09-03 11:47:32.323 WARN org.jgroups.protocols.pbcast.GMS - vmc0198-27827: not member of view [vmc0208-48939|124]; discarding it.
2020-09-03 11:49:07.160 WARN org.jgroups.protocols.pbcast.NAKACK2 - JGRP000011: vmc0198-63871: dropped message batch from non-member vmc0201-28703 (view=MergeView::[vmc0208-48939|140] (24) [ ***REMOVING MACHINE NAME AND PORT ***] ])
2020-09-03 11:49:07.160 WARN org.jgroups.protocols.pbcast.NAKACK2 - JGRP000011: vmc0198-23411: dropped message batch from non-member vmc0201-28703 (view=[***REMOVING MACHINE NAME AND PORT FOR CLEAR VIEW ***] .])
2020-09-05 16:16:07.380 DEBUG org.jgroups.protocols.FD_ALL - haven't received a heartbeat from vmc0201-55458 for 12541 ms, adding it to suspect list
2020-09-05 16:16:07.535 DEBUG org.jgroups.protocols.FD_SOCK - vmc0198-24881: failed connecting to vmc0204-45403: connect timed out
2020-09-05 16:16:07.536 DEBUG org.jgroups.protocols.FD_SOCK - vmc0198-24881: broadcasting suspect(vmc0204-45403)
2020-09-05 16:16:07.536 DEBUG org.jgroups.protocols.FD_SOCK - vmc0198-24881: pingable_mbrs=[***REMOVING MACHINE NAME AND PORT ***], ping_dest=vmc0204-54485
2020-09-05 16:16:08.513 DEBUG org.jgroups.protocols.pbcast.GMS - vmc0198-52842: installing view [ ***REMOVING MACHINE NAME AND PORT FOR CLEAR VIEW *** ]
2020-09-05 16:16:08.513 DEBUG org.jgroups.protocols.pbcast.GMS - vmc0198-24881: installing view [vmc0200-30543|2672] (184) [ ***REMOVING MACHINE NAME AND PORT FOR CLEAR VIEW *** ]
===================================
To isolate the issue we have created a small program both in Jgroup 3.4.3 and Jgroups 4.0.10
Both applications take IP addresses and the number of channels as arguments. We have run both applications in the following matrix and collected view data and timings.
Below are the stats:
Number of members (number of nodes x number of channels) Jgroups 3.4.3 Jgroup 4.0.10
225 (15x15) Simultaneous start 25 - 30 seconds* 15 minutes**
225 (15x15) Rolling start (view after 15th node start) 20 seconds* 10 minutes**
196 (14x14) Simultaneous start 25 seconds* 4 minutes**
169 (13x13) Simultaneous start 30 - 31 seconds* 7 minutes**
144 (12x12) Simultaneous start 27 seconds* 5 minutes**
121 (11x11) Simultaneous start 22 seconds* 2 minutes**
100 (10x10) Simultaneous start 20 seconds* 5 minutes**
...
...
9 to 49 channels (3x3) to (7x7) almost immediate* almost immediate*
Note: Even after taking 15 minutes, views are not stable its keeps fluctuating.
=======
Below are my protocols used with properties:
Protocol[] protocolStack=
;
However, we tried to update below few properties value but no luck
thread_pool_max_threads = 200 in UDP()
Default values of FD_ALL()