-
Bug
-
Resolution: Duplicate
-
Major
-
4.2.21
-
None
-
False
-
None
-
False
In Openshift, I'm simulating a GossipRouter crash by doing oc delete <pod>.
What I've noticed, is the relay connection drops and never comes back. I think the following events are happening:
- VERIFY_SUSPECT message is forwarded to the crashed Gossip Router
- New view is installed with cluster1 only installing view [_my-cluster-0-24934:pruivo1|2] (1) [_my-cluster-0-24934:pruivo1] (_my-cluster-0-50778:pruivo3 left)
- On cluster2, the view from cluster1 is discarded [org.jgroups.protocols.pbcast.GMS] _my-cluster-0-50778:pruivo3: not member of view [_my-cluster-0-24934:pruivo1|2]; discarding it
Eventually, cluster2 installs a view with only itself installing view [_my-cluster-0-50778:pruivo3|3] (1) [_my-cluster-0-50778:pruivo3] (_my-cluster-0-24934:pruivo1 left) but the relay connection never recoveries.
The weird part is that MERGE3 never fixes the views.
After the GossipRouter is back online, it logs both SiteMaster; but no merging ever happens and both clusters are isolated:
added _my-cluster-0-24934 (10.129.0.31:34348) to group xsite added _my-cluster-0-50778 (10.129.0.25:45384) to group xsite
- duplicates
-
JGRP-2634 Add heartbeating to TUNNEL / GossipRouter
- Resolved