We have heard frequent concerns about data reliability. Additionally, the current architecture puts a lot of manual toil on our engineering team. This feature would like to tackle the observability of the system. The system should tell us what customer data it has and has not successfully processed. Reliability is also a requirement. The system should use the breadcrumbs left from the observability work to attempt to heal without out engineering involvement. If the system cannot recover, it should alert the engineering team that several retries have failed. Last, this feature aims to simplify code that has become more complex as we have added features, scaled, and attempted to quickly fix problems.
Design Document