Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-3366

Data loss when entry forwarding to primary owner and primary owner shutdown

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • 5.2.8.Final, 6.0.0.Final
    • 5.2.4.Final, 6.0.0.Final
    • Core

      Looks like a problem in entry forwarding.

      Here is test scenario:

      • DIST numOwners=2, start with 4 nodes cluster then normal shutdown 1 node during load
      • HotRod putIfAbsent accesses from 40 threads (1 process, 1 remote cache instance), 40000 entries total

      After the test run, the numberOfEntries on each node are:

      • node1: 26608
      • node2: 26622
      • node3: 26746
      • node4: 0

      Total is 79976 and HotRod client received 11 errors, so 79976 + (11 * 2) = 79998. It means 1 entry is completely missing.

      Let's take a look at the missing entry, hash(thread16key59) = 574ff563.

      Current CH: owners(574ff563) are [node4, node1]

      The events sequence is:

      • hotrod -> node1
      • node1 forwarding it to primary owner node4
      • node4 doesn't process the forwarded entry, shutdown

      Result owners(7c29bccb) is [] empty. This entry is completely lost without any errors.

            [ISPN-3366] Data loss when entry forwarding to primary owner and primary owner shutdown

            Radim Vansa <rvansa@redhat.com> changed the Status of bug 989807 from ON_QA to VERIFIED

            RH Bugzilla Integration added a comment - Radim Vansa <rvansa@redhat.com> changed the Status of bug 989807 from ON_QA to VERIFIED

            Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 989807 from MODIFIED to ON_QA

            RH Bugzilla Integration added a comment - Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 989807 from MODIFIED to ON_QA

            Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 989807 from NEW to MODIFIED

            RH Bugzilla Integration added a comment - Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 989807 from NEW to MODIFIED

            In 6.0.x works as well.

            Radim Vansa (Inactive) added a comment - In 6.0.x works as well.

            I have verified the fix with 5.2.x as well (for 6.0.x JGRP-1675 is blocking me from doing so)

            Radim Vansa (Inactive) added a comment - I have verified the fix with 5.2.x as well (for 6.0.x JGRP-1675 is blocking me from doing so)

            Verified the fix with 5.2.x branch.

            For putIfAbsent commands, is it something we can fix within ISPN-3357?

            Takayoshi Kimura added a comment - Verified the fix with 5.2.x branch. For putIfAbsent commands, is it something we can fix within ISPN-3357 ?

            I integrated the fix both in master and in 5.2.x.

            Regular put commands should be ok now, but putIfAbsent commands may still end up with insufficient owners in certain situations.

            Dan Berindei (Inactive) added a comment - I integrated the fix both in master and in 5.2.x. Regular put commands should be ok now, but putIfAbsent commands may still end up with insufficient owners in certain situations.

            The missing backup entry is probably related to my fix: if state transfer starts just before we issue a put command, and we send the entries to the new owner just before we commit the entry in the put command, that new owner won't have a backup copy of the key.

            State transfer used to automatically forward commands to all the new owners if the topology changed during the execution of the command, but I tried to remove it because the forwarded commands were executed without holding a lock on the primary, so they could lead to inconsistencies.

            I'm still trying to come up with a better solution - perhaps sending the command to the backup owners only after the entries are committed on the primary owner will work. But in the meantime I'll just re-add the forwarding in the state transfer interceptor, as locking is not reliable in non-tx caches during state transfer anyway (two nodes could both think they are the primary owner for a key at the same time).

            Dan Berindei (Inactive) added a comment - The missing backup entry is probably related to my fix: if state transfer starts just before we issue a put command, and we send the entries to the new owner just before we commit the entry in the put command, that new owner won't have a backup copy of the key. State transfer used to automatically forward commands to all the new owners if the topology changed during the execution of the command, but I tried to remove it because the forwarded commands were executed without holding a lock on the primary, so they could lead to inconsistencies. I'm still trying to come up with a better solution - perhaps sending the command to the backup owners only after the entries are committed on the primary owner will work. But in the meantime I'll just re-add the forwarding in the state transfer interceptor, as locking is not reliable in non-tx caches during state transfer anyway (two nodes could both think they are the primary owner for a key at the same time).

            Tested 40 times, no complete data loss and missing backup entry 4 times.

            If the missing backup issue is not related to ISPN-3366 fix we can close this issue and create another one for the missing backup issue.

            Takayoshi Kimura added a comment - Tested 40 times, no complete data loss and missing backup entry 4 times. If the missing backup issue is not related to ISPN-3366 fix we can close this issue and create another one for the missing backup issue.

            Tested with https://github.com/danberindei/infinispan/tree/t_3366_52 5.2.8-SNAPSHOT.

            Ran the test 20 times and found 1 missing backup entry but it's not a complete data loss.

            hash(thread17key76)=5904ce3d

            • hotrod -> node3
            • node3 -> node1
            • node1 removed this entry due to rebalance

            See ISPN-3366-full-logs-4th.zip.

            Takayoshi Kimura added a comment - Tested with https://github.com/danberindei/infinispan/tree/t_3366_52 5.2.8-SNAPSHOT. Ran the test 20 times and found 1 missing backup entry but it's not a complete data loss. hash(thread17key76)=5904ce3d hotrod -> node3 node3 -> node1 node1 removed this entry due to rebalance See ISPN-3366 -full-logs-4th.zip.

              dberinde@redhat.com Dan Berindei (Inactive)
              rhn-support-tkimura Takayoshi Kimura
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: