Uploaded image for project: 'Subscription Watch'
  1. Subscription Watch
  2. SWATCH-2305

Create alerts for PAYG metric discrepancies

XMLWordPrintable

      We will set up alerts based on differences between the various states (with a low threshold - starting with 0%): 

       

      We can use "on(product, metric_id, billing_provider)" to join the meter counters.

      • Metered usage exceeds tallied usage
        • swatch_metrics_ingested_usage_total / swatch_tally_tallied_usage_total > 1.0 
        • Message: "Metered usage of {swatch_metrics_ingested_usage_total} exceeds tallied usage of {swatch_tally_tallied_usage_total} for product: {product_tag}, metric_id: {metric_id}, and billing_provider: {billing_provider}"
      • Tally usage exceeds metered usage
        • swatch_tally_tallied_usage_total / swatch_metrics_ingested_usage_total > 1.0 
        • Message: "Tallied Usage of {swatch_tally_tallied_usage_total} exceeds metered usage of {swatch_metrics_ingested_usage_total} or product: {product_tag}, metric_id: {metric_id}, and billing_provider: {billing_provider}"
      • Tallied usage exceeds billable/covered usage by greater than 1%; 1% allowed due to integer rounding.
        • swatch_tally_tallied_usage_total / (swatch_contract_usage_total + swatch_billable_usage_total{status="pending"}) > 1.01 
        • Message: "Tallied usage {swatch_tally_tallied_usage_total} billable and contract covered usage of {swatch_contract_usage_total + swatch_billable_usage_total{status="pending"}} by greater than 1% or product: {product_tag}, metric_id: {metric_id}, and billing_provider: {billing_provider}"
      • Billable and contract covered usage exceeds tallied usage by greater than 1%; 1% allowed due to integer rounding 
        • (swatch_contract_usage_total + swatch_billable_usage_total{status="pending"}) / swatch_tally_tallied_usage_total  > 1.01
        • Message: "Billable and contract covered usage of {swatch_contract_usage_total + swatch_billable_usage_total{status="pending"}} exceeds tallied usage of {swatch_tally_tallied_usage_total} by greater than 1% for product: {product_tag}, metric_id: {metric_id}, and billing_provider: {billing_provider}"
      • Billable usage exceeds remitted usage
        • swatch_billable_usage_total{status="pending"} offset 1h / swatch_producer_metered_total > 1.0 
        • Message: "Billable usage of {swatch_billable_usage_total {status="pending"} offset 1h} exceeds remitted usage of {swatch_producer_metered_total} for product: {product_tag}, metric_id: {metric_id}, and billing_provider: {billing_provider}"

          * Remitted usage exceeds billable usage
          ** swatch_producer_metered_total / swatch_billable_usage_total{status="pending"} offset 1h /  > 1.0
          ** Message: "Pending billable usage of {swatch_billable_usage_total{status="pending"}

      offset 1h} exceeds remitted usage of  {swatch_producer_metered_total} for product: {product_tag}, metric_id: {metric_id}, and billing_provider: {billing_provider}"

      This alerting will be available in production as well as the canary test environment.

      When these stay in a state for more than 10 minutes, we'll trigger an alert. Note we may need to adjust the percentages over time in order to reduce false positives.

      How to: https://inscope.corp.redhat.com/docs/default/Component/swatch-internal-docs/App%20Interface%20Prometheus%20Rules%20Basics/

       

      Note: How to deal with products that are not billable?

      Done

      • Separate alerts created for the scenarios listed
      • If any of the above are in the state for more than 10 minutes that the alert is fired
      • The alert will 
        • Send a message to swatch-alerts slack
        • Fire off in pager duty or whatever mechanism to text Barnaby
      • Promql tests created for each alert
      • SOP created for each alert

              Unassigned Unassigned
              khowell@redhat.com Kevin Howell
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: