Uploaded image for project: 'Subscription Watch'
  1. Subscription Watch
  2. SWATCH-2173

Design resilience document for handling failures in different Marketplace Billable Usage

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • True

      Context:
      In certain scenarios, our system encounters challenges in transmitting billable usage data to marketplaces. These issues may arise from discrepancies between the values sent during remittance and the corresponding product information in the marketplace, or they could be attributed to specific restrictions imposed by individual marketplaces.

      Hence the goal is to design and implement enhancements to the remittance process that will establish a resilient and robust system capable of effectively managing these transmission failures. The proposed solution involves implementing a mechanism that will automatically resend the billable usage data until the transmission is successfully completed. This improvement aims to ensure a seamless and accurate exchange of information between our system and the respective marketplaces.

      Design doc: https://docs.google.com/document/d/1liKSpUL1WIRO_MhUKA7OEmNCx4n8foBRmLQEvDpDz6I/edit#heading=h.11kdiw50yuu7
      KStreams POC: https://github.com/RedHatInsights/rhsm-subscriptions/pull/3012

      Done:

      • Summarized Remittance
        • Ensure a single remittance per hour for each marketplace, customer, and product metric.
        • Design should clearly outline changes in each affected service.
      • DLQ per marketplace or a single DLQ for all:
        • We can use Azure DLQ as a baseline example.
      • Failure Identification and Recovery
        • What type of alerting & dashboards should be created & where (splunk/grafana/etc.)
        • In case of failure how do we quickly identify what we remitted and what we didn't
        • API to reset the remittance pending value (existing API uses time range) - this design should consider how/if changes are needed in this API.
        • Types of failures we need to recover from: 
          • Contract ingestion
          • Reading from prometheus
          • Processing Tallies
          • Marketplace sending. 
          • ...
      • Recalculate Remittance:
        • Since the re-tally won't be possible in future this design doc should consider API for "recalculate remittance". 
        • Explore alternative approaches:
          • Evaluate the feasibility of resending events.
          • Consider making necessary adjustments in related tables.
          • Explore the use of a flag to determine remittance failure and restart the process from that point.
      • Aggregate Usages Per Hour
        • AWS and Azure marketplaces only allow one usage be sent per resource per hour but currently, when we retry a billable-usage it will fail because the current hour will already be billed for
        • Design a way to extrapolate aggregation logic from swatch-producer-azure and use for all marketplace producers
        • https://miro.com/app/board/uXjVNzq0xR4=/
      • Diagrams (miro/mermaid/plantuml):
        • A miro board detailing the process from tally to marketplace.
        • This needs to include failure cases as well as happy path so that we can analyze what happens when failures occur in each service. 
        • Please include something that can be included in documentation/source code detailing in a way that doesn't need svg generation every time in https://github.com/RedHatInsights/rhsm-subscriptions/tree/main/docs actual checkin to the source repository will happen after approval of the design. 
        •  

              khowell@redhat.com Kevin Howell
              karshah@redhat.com Kartik Shah
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: