Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-235

Messages sent directly after joining occasionally lost on high load

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.3
    • 2.2.8, 2.2.9, 2.2.9.1, 2.2.9.2
    • None
    • High
    • Workaround Exists
    • Hide
      • Wait until the cluster has been fully started until sending messages, e.g. register for view changes, increment counter, and when a min number has been exceeded, start sending messages
      • Crude: a little timeout before sending messages, works fine for smaller clusters (-8 nodes)
      Show
      Wait until the cluster has been fully started until sending messages, e.g. register for view changes, increment counter, and when a min number has been exceeded, start sending messages Crude: a little timeout before sending messages, works fine for smaller clusters (-8 nodes)

      When we have group

      {A,B,C} and D joins, then the coordinator (A) runs the following algorithm for handling JOIN(D):
      #1 Compute new view V2={A,B,C,D}
      #2 Send unicast response with JOIN_RSP(V2) to D (D installs V2)
      #3 Multicast V2 to {A,B,C}

      , all install V2

      If D multicasts a message to the cluster before the existing members install V2, then those members who hadn't installed V2 when the message from D was received, will discard it because D is not in their view (still V1). If the message from D modified some state, e.g. a put(key,val) for a replicated hashmap, then the hashmaps will have inconsistent states.

      SOLUTION:
      Swap #3 and #2, multicast V2 first (and wait for all view_acks), then send the JOIN_RSP to D. This way, anyone of

      {A,B,C}

      could multicast a message to the cluster (including D) before D installed V2, so D would discard the message. However, if there is state involved, D will fetch the state from A anyway and overwrite whatever that spurious message caused to change in state.
      The chance of this happening is relatively small anyway. However, the real solution will be FLUSH (see JGroups/doc/design/FLUSH.txt towards the end of the document)

            rhn-engineering-bban Bela Ban
            rhn-engineering-bban Bela Ban
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved: