-
Task
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
False
-
-
False
-
subs-swatch-lightning
-
-
To understand better the issue, this is the scenario:
1. Receive an event when a metric value of "100"
2. Process an aggregate usage including the above metric value of "100"
3. Then, the producer fails to submit this usage, so the record stays with status "failed"
4. We receive another event with a metric value of "20"
5. Because, there is an existing "failed" usage, we add the previous value with the new one, so the total usage is now "120"
6. Then, the producer fails again to submit this usage, so the record stays with status "failed"
What the problem is with the metrics?
- swatch_billable_usage_total
The metric "swatch_billable_usage_total" will count both the values "100" and "120", when it should count only "100" and "20".
- swatch_producer_metered_total
This only happens when the producer fails to submit the usage. The metric will count both the values "100" and "120", when it should count "100" and "20".
Note that when processing the metric "swatch_billable_usage_total", we know what the current value is "20", so we could easily fix this metric. However, we don't have the current value "20" in "swatch_producer_metered_total", so I don't think we can't fix this metric.
Acceptance Criteria
- Give ideas about how to fix these two metrics
- Reproduce the scenario using an iqe component tests
- IQE reproducer in swatch-billable-usage (swatch_billable_usage_total)
- IQE reproducer in swatch-producer-aws (swatch_producer_metered_total)
- IQE reproducer in swatch-producer-azure (swatch_producer_metered_total)
- Fix the metrics
- is related to
-
SWATCH-3571 Spike: Investigate and Verify Prometheus Metric Accuracy for Metering
-
- Closed
-
- relates to
-
SWATCH-3648 Spike: Investigate the accuracy of the PAYG metrics for alerting
-
- Backlog
-