Uploaded image for project: 'Red Hat Data Grid'
  1. Red Hat Data Grid
  2. JDG-6918

View change during a cache join can lead to not replicating data

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • RHDG 8.4.8 GA
    • RHDG 8.1.1 GA, RHDG 8.4.6 GA
    • Core
    • None

    Description

      The issue might happen when nodes start concurrently. When one member is joining a cache and the JGroups view changes, the state transfer wrongly uses the new joiner as an existing member, causing data not to transfer correctly. This happens because the joiner will retry the request in the new view.

       

      On start, the node sends the join request to the coordinator. The coordinator handles the request, starts a rebalance, and updates the topology to contain the pending consistent hash. However, if the JGroups view changes, the joiner resends the join request and receives back the topology with a pending consistent hash. This causes the state transfer to consider the new node is already part of the topology and is a state donor.

       

      If the view changes before the node receives the join response, the issue happens consistently, regardless of whether it is configured to DIST or REPL cache. An example for each, where Node 0 and 1 start first, Node 2 sends the join request, and Node 3 starts concurrently, changing the view, we have:

       In REPL:

      Node 0: size 100/100
      Node 1: size 100/100
      Node 2: size 0/100
      Node 3: size 66/100
      

       

      In DIST:

      Node 0: size 80/100
      Node 1: size 82/100
      Node 2: size 48/100
      Node 3: size 69/100
      

       

      Any subsequent node that joins will also have missing data from the transfer. To summarize, the issue has a tight window to occur:

       

      • The coordinator already processed the join request;
      • The view updates on the joiner before the join response is handled.

      Also, keep in mind that the problem affects only the single cache the node is joining. Other caches will continue to work correctly.

       


       

      The only workaround I can think of to mitigate this would be starting nodes in an ordered way. This is especially difficult for embedded users. It would require on the client application a mechanism to scale up by one, and only after the previous node is guaranteed to have joined the caches.

       

       

      Attachments

        Issue Links

          Activity

            People

              rh-ee-jbolina Jose Bolina
              rhn-support-wfink Wolf Fink
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: