-
Story
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
None
-
5
-
False
-
-
False
-
-
We will set up alerts based on differences between the various states (with a low threshold - starting with 0%):
- swatch_metrics_ingested_usage_total / swatch_tally_tallied_usage_total > 1.0 (Metered usage exceeds tallied usage)
- swatch_tally_tallied_usage_total / (swatch_contract_usage_total + swatch_billable_usage_total{status="pending"}) > 1.01 (Tallied usage exceeds billable/covered usage by greater than 1%; 1% allowed due to integer rounding).
- swatch_billable_usage_total{status="pending"} offset 1h / swatch_producer_metered_total > 1.0 (Billable usage exceeds remitted usage)
This alerting will be available in production as well as the canary test environment.
Refinement:
- failed status alert?
- retriable?
- components?
- QE - write some promrules tests?
When these stay in a state for more than 10 minutes, we'll trigger an alert. Note we may need to adjust the percentages over time in order to reduce false positives.
- is blocked by
-
SWATCH-2300 Add a metric for usage covered by a contract
- In Progress
-
SWATCH-2301 Add a metric for usage considered billable
- In Progress
-
SWATCH-2302 Add a metric to swatch-producer-aws for metered usage
- Review
-
SWATCH-2303 Add a metric to swatch-producer-azure for metered usage
- Review
-
SWATCH-2297 Add a metric for ingested usage data from Prometheus
- Release Pending
-
SWATCH-2299 Add a metric for usage tallied
- Release Pending
-
SWATCH-2304 Add a metric for usage aggregated per hour
- Release Pending