[JBTM-3945] Handle state model update failures - Red Hat Issue Tracker

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: Narayana-LRA-0.0.9.Final, Narayana-LRA-0.0.10.Final
Affects Version/s: 7.1.0.Final, Narayana-LRA-0.0.9.Final
Component/s: LRA
Labels:
None

Release Note Text:

Hide
If the storage used for the protocol becomes unavailable then pause the protocol until it becomes available again. Client and participants requests will receive retry error codes and they should either periodically retry or resolve the storage instability.

Show
If the storage used for the protocol becomes unavailable then pause the protocol until it becomes available again. Client and participants requests will receive retry error codes and they should either periodically retry or resolve the storage instability.

The LRA protocol enables a system to perform correctly whilst tolerating various faults. To achieve this it must durably save data as the system transitions to new states and if the system does not have access to stable storage then it needs to be able to report that to the initiating system and to participants.

Since the state cannot be saved we need a predictable algorithm to proceed. In the presence of faults continuing is feasible but is high risk and the safest strategy is to pause the protocol until the storage becomes accessible again and until that happens retry error codes should be returned to clients and participants.

The kinds of failures we'd like tests for include

general:
[ ] test different orders of execution and concurrency and race conditions
[ ] throw exceptions from unexpected places and verify
[ ] check that we handle cases where participants respond out of spec
[ ] lock acquisition failures

participant filter calls to the coordinator:
failed calls:
[ ] network request timeout
[X] enlistCompensator
[X] start LRA
[X] end LRA
[ ] leaveLRA
[ ] setCurrentLRA
[ ] getStatus

coordinator failures:
[ ] network request timeout
[ ] store write failures (probably covered by the others)
[ ] not being able to contact participants
[ ] duplicate messages

[ ] don't need to handle:
corrupted messages
etc

Assignee:: Michael Musgrove

Reporter:: Michael Musgrove

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/11/05 10:44 AM

Updated:: 2024/12/03 4:32 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide