Uploaded image for project: 'JBoss Transaction Manager'
  1. JBoss Transaction Manager
  2. JBTM-3945

Handle state model update failures

XMLWordPrintable

    • Hide
      If the storage used for the protocol becomes unavailable then pause the protocol until it becomes available again. Client and participants requests will receive retry error codes and they should either periodically retry or resolve the storage instability.
      Show
      If the storage used for the protocol becomes unavailable then pause the protocol until it becomes available again. Client and participants requests will receive retry error codes and they should either periodically retry or resolve the storage instability.

      The LRA protocol enables a system to perform correctly whilst tolerating various faults. To achieve this it must durably save data as the system transitions to new states and if the system does not have access to stable storage then it needs to be able to report that to the initiating system and to participants.

      Since the state cannot be saved we need a predictable algorithm to proceed. In the presence of faults continuing is feasible but is high risk and the safest strategy is to pause the protocol until the storage becomes accessible again and until that happens retry error codes should be returned to clients and participants.

      The kinds of failures we'd like tests for include

      general:
        [ ] test different orders of execution and concurrency and race conditions
        [ ] throw exceptions from unexpected places and verify
        [ ] check that we handle cases where participants respond out of spec
        [ ] lock acquisition failures

      participant filter calls to the coordinator:
        failed calls:
        [ ] network request timeout
        [X] enlistCompensator
        [X] start LRA
        [X] end LRA
        [ ] leaveLRA
        [ ] setCurrentLRA
        [ ] getStatus

      coordinator failures:
        [ ] network request timeout
        [ ] store write failures (probably covered by the others)
        [ ] not being able to contact participants
        [ ] duplicate messages

        [ ] don't need to handle:
          corrupted messages
          etc

              rhn-engineering-mmusgrov Michael Musgrove
              rhn-engineering-mmusgrov Michael Musgrove
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: