Overview
ATS relies on queue processing to perform many of its tasks. Currently, the queue processing implementation is quite naive, with challenges such as:
- Events are rescheduled in large batches, creating a potential stampeding herd problem and queue live-lock if certain events cannot be processed
- There is no back-off when events fail to be processed e.g an external dependency is down
- There is no jitter applied to events that are rescheduled to create a more even distribution of work
- Controllers indicate an event should be rescheduled by returning an error
- Controllers have no ability to indicate when an event should next be retried
The purpose of this epic is two-fold:
- Incrementally improve the queue handling in ATS to address the above
- Contribute the changes upstream into the rh-trex project so that all future micro-services benefit from our work
Acceptance Criteria
Done Criteria
- All Acceptance Criteria are met
- All existing/affected SOPs have been updated.
- New SOPs have been written.
- Internal training has been developed and delivered.
- The feature has full, automated test suites passing in all pipelines.
- If the feature requires QE involvement, QE has signed off.
- The feature exposes metrics necessary to monitor.
- The feature has had a security review / Contract impact assessment.
- Service documentation is fully updated and complete.
- Product Manager signed off.
References
Links to Gdocs, GitHub, and any other relevant information about this epic.