Uploaded image for project: 'Red Hat Data Grid'
  1. Red Hat Data Grid
  2. JDG-1426

Data loss caused by a single node which had a long GC pause

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • JDG 7.2 CR2
    • JDG 7.1.1 GA
    • Clustering
    • None
    • CR2
    • +
    • Hide

      1. Start 3 nodes cluster.
      2. Stop one node by kill -STOP $PID.
      3. Wait the cluster view has changed.
      4. Put some data to a cache.
      5. Restart the node by kill -CONT $PID.
      6. Wait a merged cluster view has been received.
      7. Check all the cache data you have put. Some become null.

      Show
      1. Start 3 nodes cluster. 2. Stop one node by kill -STOP $PID . 3. Wait the cluster view has changed. 4. Put some data to a cache. 5. Restart the node by kill -CONT $PID . 6. Wait a merged cluster view has been received. 7. Check all the cache data you have put. Some become null.
    • JDG Sprint #9, JDG Sprint #10, JDG Sprint #11, JDG Sprint #12

      There is a data loss scenario that can happen frequently. If a node is dropped from a cluster but the node is actually alive (the java process persists with its heap contents, like by a long GC pause), it rejoins to the cluster soon and override the cluster topology because the dropped node has a topology with the largest size.

      For example, suppose a cluster of 3 nodes; A, B and C. C had a long GC pause and is dropped from the cluster. A and B form a new cluster with size 2. When C backs to the cluster, it overrides the topology of size 2 because C remembers a topology when its size was 3. Some of updates when the size was 2 are not accessible any more.

      All versions of JDG 6, JDG 7 and Infinispan 9 are affected.

              remerson@redhat.com Ryan Emerson
              rhn-support-onagano Osamu Nagano
              Diego Lovison Diego Lovison
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: