Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-4546

Possible stale lock when the primary owner leaves during rebalance

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • 7.2.0.Final
    • 7.0.0.Alpha5, 7.1.1.Final
    • Core, State Transfer
    • None

      Topology T: coordinator = A, owners(k) = [C, D], pending_owners(k) = null
      B sends prepareCommand(tx1, put(k, v)) to C, D
      D adds backup locks and replies
      C acquires lock, ready to send reply to B
      A starts installing topology T+1: owners(k) = [C, D], pending_owners(k) = [C, E]
      A, C and E install topology T+1, B and D do not
      E requests and receives tx data from C, including tx1
      C leaves
      B sees a SuspectException, sends rollbackCommand(tx1) to C, D
      D removes tx1
      C has left, but is ignored
      B reports to the user that the tx has been rolled back
      B and D install topology T+1 (optional)
      A starts installing topology T+2: owners(k) = [D], pending_owners(k) = [E]
      A, B, D, E all install topology T+2
      E requests and receives state from D, but it does not remove tx1
      A starts installing topology T+3: owners(k) = [E], pending_owners(k) = null
      E now has a stale backup lock on k

      It seems very hard to reproduce in production: C would have to leave soon enough so that B and D haven't received the T+1 topology yet, but late enough for it to send its transaction data to E.

      A possible solution would be to catch any SuspectException during prepare/commit/rollback (without ignoring leavers), wait for a new topology, and replicate the command again on the new owners. Obviously, this wouldn't work with asynchronous prepare/commit/rollback.

              dberinde@redhat.com Dan Berindei (Inactive)
              dberinde@redhat.com Dan Berindei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: