-
Task
-
Resolution: Done
-
Normal
-
None
-
None
-
None
-
5
-
False
-
-
True
-
-
This spike aims to ensure the accuracy of the Prometheus metrics for our meter pipeline. We need to investigate and verify that the data reported by the following metrics aligns with the corresponding data in the database:
- swatch_metrics_ingested_usage_total
- should align with events table
- swatch_tally_tallied_usage_total
- should align with tally_snapshots/tally_measurements tables
- swatch_contract_usage_total
- should align with contracts/contract_metrics tables
- swatch_billable_usage_total
- should align with billable_usage_remittance table
- note should verify the different statuses(failed, succeeded, pending)
- swatch_producer_metered_total
- this is not stored in a database but we can verify against splunk
- example query:
index=rh_rhsm namespace=rhsm-prod host="swatch-producer-aws*" success | rex "Quantity=(?<value>[0-9.]+)" | rex "Dimension=(?<dimension>\w+)" | rex "productId=(?<product_id>(\w|\-)+)" | rex "metricId=(?<metric_id>\w+)" | fields + value, dimension, product_id, metric_id | mvcombine value | eval total=sum(value) | table product_id, metric_id, dimension, total
Dashboard utilizing these metrics:
https://grafana.stage.devshift.net/d/aec1blbwi445cf/subscription-watch-payg-metrics?orgId=1
The goal of this spike is to gain confidence in the reliability of these metrics before implementing alerting rules based on them.
Acceptance Criteria:
- Identify methods to query the database for the underlying data corresponding to each of the listed Prometheus metrics.
- Document the process for verifying the accuracy of each metric.
- Report findings on the accuracy of the metrics and any discrepancies found.
- blocks
-
SWATCH-2305 Create alerts for PAYG metric discrepancies
-
- Backlog
-
-
SWATCH-3573 Spike: Test and Define PromQL Alert Queries for Metering Pipeline
-
- Backlog
-
- relates to
-
SWATCH-3633 Spike: Investigate how to fix the metrics swatch_billable_usage_total and swatch_producer_metered_total to not exclude already counted usage
-
- Backlog
-
-
SWATCH-3648 Spike: Investigate the accuracy of the PAYG metrics for alerting
-
- Backlog
-