Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1243

FD_SOCK: reduce number of messages sent on a suspicion

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Major Major
    • 2.11
    • None
    • None

      • When B suspects C, B multicasts a SUSPECT(C) message
      • Everyone receives the SUSPECT(C) message and passes it up and down the stack as a SUSPECT(C) event
      • VERIFY_SUSPECT on every member sends one (or more) ARE_YOU_DEAD messages to C
      • C replies to the sender with a I_AM_NOT_DEAD messages, or not if crashed
      • However, only the coordinator (or next in line) actually processes the SUSPECT(C) event in GMS !
        --> All of the VERIFY_SUSPECT processing is superfluous unless it is the coord or next-in-line !

      The number of messages used for a false suspicion are (1 SUSPECT mcast) + ((N-1) ARE_YOU_DEAD unicasts) + ((N-1) I_AM_NOT_DEAD unicasts)) !

      SOLUTION:

      • The SUSPECT(C) message could be sent as a unicast only to the coordinator and the next-in-line member. Maybe we could use a max_rank=2 for this, similar to the suggested solution for FD_ALL ? This would be good for non multicast based transports, e.g. TCP
      • The SUSPECT(C) message is multicast to everyone, but only the coord and next-in-line start the VERIFY_SUSPECT processing

      Issue: if we have

      {A,B,C,D,E}

      , what happens if A,B and C crash at the same time ?

      • E's connection to A closes: E sends a SUSPECT(A) to B and C (excluding suspected A)
        --> B and C are dead and won't process the message !
      • Then E suspects B and sends a SUSPECT(A,B) to C and D (excluding suspected A and B)
      • C adds A and B to its suspect list and finds out it is the next-in-line
      • C then runs the VERIFY_SUSPECT protocol
      • C passes the SUSPECT(A,B) event up the stack
      • C becomes the new coord

              rhn-engineering-bban Bela Ban
              rhn-engineering-bban Bela Ban
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: