Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1467

synchronism between FD and UDP protocols

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Major
    • Resolution: Out of Date
    • Affects Version/s: 2.4.3
    • Fix Version/s: 3.1
    • Labels:
      None

      Description

      We've been suffering from problems with our jgroup cluster. We have 3 nodes, A (172.20.177.13:36441), B (172.20.177.14:55150) and C (172.20.177.15:47943), being A coordinator. B begans to suspect C because are-you-alive message is not properly received (but C has sent it!!).

      These are B traces:

      2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 172.20.177.15:47943 (own address=172.20.177.14:55150)
      2012-05-17 13:56:59,243 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.15:47943 (src=172.20.177.14:55150), headers are

      {FD=[FD: heartbeat], UDP=[channel_name=AxisPartition]}

      2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] [172.20.177.14:55150]: received no heartbeat ack from 172.20.177.15:47943 for 17 times (340000 milliseconds), suspecting it
      2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[172.20.177.15:47943]] to group
      2012-05-17 13:56:59,243 TRACE [org.jgroups.protocols.UDP] sending msg to null (src=172.20.177.14:55150), headers are

      {FD=[FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)], UDP=[channel_name=AxisPartition]}

      2012-05-17 13:56:59,243 DEBUG [org.jgroups.protocols.FD] task done
      2012-05-17 13:56:59,243 TRACE [org.jgroups.protocols.UDP] received (mcast) 137 bytes from 172.20.177.14:38864
      2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] received (ucast) 105 bytes from 172.20.177.15:47943
      2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] received (ucast) 105 bytes from 172.20.177.15:47943

      You can see UDP messages (105 bytes) from C node, one millisecond after B sent its are-you-alive. But FD protocol is saying that no heartbeat ack was received

      And these ones are C's:

      2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.FD] received are-you-alive from 172.20.177.14:55150, sending response
      2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.14:55150 (src=172.20.177.15:47943), headers are

      {FD=[FD: heartbeat ack], UDP=[channel_name=AxisPartition]}

      2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.UDP] message is [dst: 224.1.2.3:45566, src: 172.20.177.14:55150 (2 headers), size = 0 bytes], headers are

      {UDP=[channel_name=AxisPartition], FD=[FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]}

      2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]
      2012-05-17 13:56:59,248 WARN [org.jgroups.protocols.FD] I was suspected by 172.20.177.14:55150; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
      2012-05-17 13:56:59,248 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.14:55150 (src=172.20.177.15:47943), headers are

      {FD=[FD: heartbeat ack], UDP=[channel_name=AxisPartition]}

      2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.UDP] received (ucast) 116 bytes from 172.20.177.13:36441
      2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.UDP] message is [dst: 172.20.177.15:47943, src: 172.20.177.13:36441 (2 headers), size = 0 bytes], headers are

      {UDP=[channel_name=AxisPartition], VERIFY_SUSPECT=[VERIFY_SUSPECT: ARE_YOU_DEAD]}

      2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.FD] received msg from 172.20.177.13:36441 (counts as ack)
      2012-05-17 13:56:59,249 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.13:36441 (src=172.20.177.15:47943), headers are

      {VERIFY_SUSPECT=[VERIFY_SUSPECT: I_AM_NOT_DEAD], UDP=[channel_name=AxisPartition]}

      You can see C is ignoring suspecting message. But heartbeat_ack is not being processed by B.

      Last, these are A's traces:

      2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] received (mcast) 137 bytes from 172.20.177.14:38864
      2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] message is [dst: 224.1.2.3:45566, src: 172.20.177.14:55150 (2 headers), size = 0 bytes], headers are

      {UDP=[channel_name=AxisPartition], FD=[FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]}

      2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[172.20.177.15:47943], from=172.20.177.14:55150)]
      2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.VERIFY_SUSPECT] verifying that 172.20.177.15:47943 is dead
      2012-05-17 13:56:59,244 TRACE [org.jgroups.protocols.UDP] sending msg to 172.20.177.15:47943 (src=172.20.177.13:36441), headers are

      {VERIFY_SUSPECT=[VERIFY_SUSPECT: ARE_YOU_DEAD], UDP=[channel_name=AxisPartition]}

      2012-05-17 13:56:59,245 TRACE [org.jgroups.protocols.UDP] received (ucast) 116 bytes from 172.20.177.15:47943
      2012-05-17 13:56:59,245 TRACE [org.jgroups.protocols.UDP] message is [dst: 172.20.177.13:36441, src: 172.20.177.15:47943 (2 headers), size = 0 bytes], headers are

      {UDP=[channel_name=AxisPartition], VERIFY_SUSPECT=[VERIFY_SUSPECT: I_AM_NOT_DEAD]}

      2012-05-17 13:56:59,245 TRACE [org.jgroups.protocols.VERIFY_SUSPECT] member 172.20.177.15:47943 is not dead !
      2012-05-17 13:56:59,245 DEBUG [org.jgroups.protocols.FD] member is 172.20.177.15:47943
      2012-05-17 13:56:59,245 DEBUG [org.jgroups.protocols.FD_SOCK] member is 172.20.177.15:47943

      A detected a wrong suspect, and consecuently, the cluster goes on having the three members. But B is not working properly, and every message the remainder of the nodes sent to it, is not received. So that, the cluster is losing messages and hence the users are being affected.

      Furthermore, because the cluster is OK for the coordinator, there is no way to know that B is not working. I have reviewed every MBean regarding de cluster and in all of them the cluster is OK, with the three members.

      Any issue?
      Is there any way to detect that B is not working at all?

      Thanks in advance.
      Best Regards,
      Pablo.

        Attachments

          Activity

            People

            Assignee:
            belaban Bela Ban
            Reporter:
            vgedelivery Pablo Estebanez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: