-
Bug
-
Resolution: Unresolved
-
Major
-
7.1.0.Final, Narayana-LRA-0.0.9.Final
-
None
The LRA protocol enables a system to perform correctly whilst tolerating various faults. To achieve this it must durably save data as the system transitions to new states and if the system does not have access to stable storage then it needs to be able to report that to the initiating system and to participants.
Since the state cannot be saved we need a predictable algorithm to proceed. In the presence of faults continuing is feasible but is high risk and the safest strategy is to pause the protocol until the storage becomes accessible again and until that happens retry error codes should be returned to clients and participants.
The kinds of failures we'd like tests for include
general:
[ ] test different orders of execution and concurrency and race conditions
[ ] throw exceptions from unexpected places and verify
[ ] check that we handle cases where participants respond out of spec
[ ] lock acquisition failures
participant filter calls to the coordinator:
failed calls:
[ ] network request timeout
[X] enlistCompensator
[X] start LRA
[X] end LRA
[ ] leaveLRA
[ ] setCurrentLRA
[ ] getStatus
coordinator failures:
[ ] network request timeout
[ ] store write failures (probably covered by the others)
[ ] not being able to contact participants
[ ] duplicate messages
[ ] don't need to handle:
corrupted messages
etc