-
Bug
-
Resolution: Obsolete
-
Major
-
2.4.3
-
None
We've been suffering from problems with our jgroup cluster. We have 3 nodes, A (172.20.177.13:36441), B (172.20.177.14:55150) and C (172.20.177.15:47943), being A coordinator. B begans to suspect C because are-you-alive message is not properly received (but C has sent it!!).
These are B traces:
2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 172.20.177.15:47943 (own address=172.20.177.14:55150)
2012-05-17 13:56:59,243 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.15:47943 (src=172.20.177.14:55150), headers are
2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] [172.20.177.14:55150]: received no heartbeat ack from 172.20.177.15:47943 for 17 times (340000 milliseconds), suspecting it
2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[172.20.177.15:47943]] to group
2012-05-17 13:56:59,243 TRACE [org.jgroups.protocols.UDP] sending msg to null (src=172.20.177.14:55150), headers are
2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] task done
2012-05-17 13:56:59,243 TRACE [org.jgroups.protocols.UDP] received (mcast) 137 bytes from 172.20.177.14:38864
2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] received (ucast) 105 bytes from 172.20.177.15:47943
2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] received (ucast) 105 bytes from 172.20.177.15:47943
You can see UDP messages (105 bytes) from C node, one millisecond after B sent its are-you-alive. But FD protocol is saying that no heartbeat ack was received
And these ones are C's:
2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.FD] received are-you-alive from 172.20.177.14:55150, sending response
2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.14:55150 (src=172.20.177.15:47943), headers are
2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.UDP] message is [dst: 224.1.2.3:45566, src: 172.20.177.14:55150 (2 headers), size = 0 bytes], headers are
{UDP=[channel_name=AxisPartition], FD=[FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]}2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]
2012-05-17 13:56:59,248 WARN [org.jgroups.protocols.FD] I was suspected by 172.20.177.14:55150; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.14:55150 (src=172.20.177.15:47943), headers are
2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.UDP] received (ucast) 116 bytes from 172.20.177.13:36441
2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.UDP] message is [dst: 172.20.177.15:47943, src: 172.20.177.13:36441 (2 headers), size = 0 bytes], headers are
2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.FD] received msg from 172.20.177.13:36441 (counts as ack)
2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.13:36441 (src=172.20.177.15:47943), headers are
You can see C is ignoring suspecting message. But heartbeat_ack is not being processed by B.
Last, these are A's traces:
2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] received (mcast) 137 bytes from 172.20.177.14:38864
2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] message is [dst: 224.1.2.3:45566, src: 172.20.177.14:55150 (2 headers), size = 0 bytes], headers are
2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]
2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.VERIFY_SUSPECT] verifying that 172.20.177.15:47943 is dead
2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.15:47943 (src=172.20.177.13:36441), headers are
2012-05-17 13:56:59,245 TRACE [org.jgroups.protocols.UDP] received (ucast) 116 bytes from 172.20.177.15:47943
2012-05-17 13:56:59,245 TRACE [org.jgroups.protocols.UDP] message is [dst: 172.20.177.13:36441, src: 172.20.177.15:47943 (2 headers), size = 0 bytes], headers are
2012-05-17 13:56:59,245 TRACE [org.jgroups.protocols.VERIFY_SUSPECT] member 172.20.177.15:47943 is not dead !
2012-05-17 13:56:59,245 DEBUG [org.jgroups.protocols.FD] member is 172.20.177.15:47943
2012-05-17 13:56:59,245 DEBUG [org.jgroups.protocols.FD_SOCK] member is 172.20.177.15:47943
A detected a wrong suspect, and consecuently, the cluster goes on having the three members. But B is not working properly, and every message the remainder of the nodes sent to it, is not received. So that, the cluster is losing messages and hence the users are being affected.
Furthermore, because the cluster is OK for the coordinator, there is no way to know that B is not working. I have reviewed every MBean regarding de cluster and in all of them the cluster is OK, with the three members.
Any issue?
Is there any way to detect that B is not working at all?
Thanks in advance.
Best Regards,
Pablo.