-
Task
-
Resolution: Done
-
Critical
-
None
-
None
-
None
-
False
-
-
True
-
-
Context:
In certain scenarios, our system encounters challenges in transmitting billable usage data to marketplaces. These issues may arise from discrepancies between the values sent during remittance and the corresponding product information in the marketplace, or they could be attributed to specific restrictions imposed by individual marketplaces.
Hence the goal is to design and implement enhancements to the remittance process that will establish a resilient and robust system capable of effectively managing these transmission failures. The proposed solution involves implementing a mechanism that will automatically resend the billable usage data until the transmission is successfully completed. This improvement aims to ensure a seamless and accurate exchange of information between our system and the respective marketplaces.
Design doc: https://docs.google.com/document/d/1liKSpUL1WIRO_MhUKA7OEmNCx4n8foBRmLQEvDpDz6I/edit#heading=h.11kdiw50yuu7
KStreams POC: https://github.com/RedHatInsights/rhsm-subscriptions/pull/3012
Done:
- Summarized Remittance
- Ensure a single remittance per hour for each marketplace, customer, and product metric.
- Design should clearly outline changes in each affected service.
- DLQ per marketplace or a single DLQ for all:
- We can use Azure DLQ as a baseline example.
- Failure Identification and Recovery
- What type of alerting & dashboards should be created & where (splunk/grafana/etc.)
- In case of failure how do we quickly identify what we remitted and what we didn't
- API to reset the remittance pending value (existing API uses time range) - this design should consider how/if changes are needed in this API.
- Types of failures we need to recover from:
- Contract ingestion
- Reading from prometheus
- Processing Tallies
- Marketplace sending.
- ...
- Recalculate Remittance:
- Since the re-tally won't be possible in future this design doc should consider API for "recalculate remittance".
- Explore alternative approaches:
- Evaluate the feasibility of resending events.
- Consider making necessary adjustments in related tables.
- Explore the use of a flag to determine remittance failure and restart the process from that point.
- Aggregate Usages Per Hour
- AWS and Azure marketplaces only allow one usage be sent per resource per hour but currently, when we retry a billable-usage it will fail because the current hour will already be billed for
- Design a way to extrapolate aggregation logic from swatch-producer-azure and use for all marketplace producers
- https://miro.com/app/board/uXjVNzq0xR4=/
- Diagrams (miro/mermaid/plantuml):
- A miro board detailing the process from tally to marketplace.
- This needs to include failure cases as well as happy path so that we can analyze what happens when failures occur in each service.
- Please include something that can be included in documentation/source code detailing in a way that doesn't need svg generation every time in https://github.com/RedHatInsights/rhsm-subscriptions/tree/main/docs actual checkin to the source repository will happen after approval of the design.
- is related to
-
SWATCH-2307 Fine Grained PAYG Errors
- Backlog
-
SWATCH-2293 PAYG Monitoring & Alerting Improvements
- In Progress
-
SWATCH-2284 Billable Usage Retries & Status Tracking Improvements
- In Progress
-
SWATCH-1964 Remove use of product name as an identifier. We should use product tag as an identifier instead.
- Closed
- relates to
-
SWATCH-2183 Design Billable Usage Aggregation Process
- Closed