Uploaded image for project: 'Red Hat Data Grid'
  1. Red Hat Data Grid
  2. JDG-1426

Data loss caused by a single node which had a long GC pause

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Verified (View Workflow)
    • Priority: Blocker
    • Resolution: Done
    • Affects Version/s: JDG 7.1.1 GA
    • Fix Version/s: JDG 7.2 CR2
    • Component/s: Clustering
    • Labels:
      None
    • Target Release:
    • Fix Build:
      CR2
    • Steps to Reproduce:
      Hide

      1. Start 3 nodes cluster.
      2. Stop one node by kill -STOP $PID.
      3. Wait the cluster view has changed.
      4. Put some data to a cache.
      5. Restart the node by kill -CONT $PID.
      6. Wait a merged cluster view has been received.
      7. Check all the cache data you have put. Some become null.

      Show
      1. Start 3 nodes cluster. 2. Stop one node by kill -STOP $PID . 3. Wait the cluster view has changed. 4. Put some data to a cache. 5. Restart the node by kill -CONT $PID . 6. Wait a merged cluster view has been received. 7. Check all the cache data you have put. Some become null.
    • Sprint:
      JDG Sprint #9, JDG Sprint #10, JDG Sprint #11, JDG Sprint #12
    • QE Test Coverage:
      +

      Description

      There is a data loss scenario that can happen frequently. If a node is dropped from a cluster but the node is actually alive (the java process persists with its heap contents, like by a long GC pause), it rejoins to the cluster soon and override the cluster topology because the dropped node has a topology with the largest size.

      For example, suppose a cluster of 3 nodes; A, B and C. C had a long GC pause and is dropped from the cluster. A and B form a new cluster with size 2. When C backs to the cluster, it overrides the topology of size 2 because C remembers a topology when its size was 3. Some of updates when the size was 2 are not accessible any more.

      All versions of JDG 6, JDG 7 and Infinispan 9 are affected.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              ryanemerson Ryan Emerson
              Reporter:
              osamu.nagano Osamu Nagano
              Tester:
              Diego Lovison Diego Lovison
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: