-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
None
-
False
-
-
False
-
subs-swatch-thunder
-
-
The /v1/tally/products/<PRODUCT_ID>/<METRIC_ID> endpoint is timing out for customers with large amounts of data, causing 500 errors and degraded user experience.
We've identified that the query joining tally_snapshots with tally_measurements exceeds the PostgreSQL statement timeout for organizations with significant data volumes (e.g., with ~890K tally_snapshots records). The gateway returns a 009 status code (NGINX client-closed connection) after 30 seconds, while the database eventually cancels the query due to statement timeout.
The problematic query is:
SELECT DISTINCT ts1_0.id, ts1_0.billing_account_id, ts1_0.billing_provider,
ts1_0.granularity, ts1_0.org_id, ts1_0.product_id, ts1_0.sla,
ts1_0.snapshot_date, tm1_0.snapshot_id, tm1_0.measurement_type,
tm1_0.metric_id, tm1_0.value, ts1_0.usage
FROM tally_snapshots ts1_0
LEFT JOIN tally_measurements tm1_0 ON ts1_0.id = tm1_0.snapshot_id
WHERE ts1_0.org_id = ?
AND ts1_0.product_id = ?
AND ts1_0.granularity = ?
AND ts1_0.sla = ?
AND ts1_0.usage = ?
AND ts1_0.billing_provider = ?
AND ts1_0.billing_account_id = ?
AND ts1_0.snapshot_date BETWEEN ? AND ?
ORDER BY ts1_0.snapshot_date
Data volume for affected organization:
- tally_snapshots: 890,082 records
- tally_measurements: 2,464,843 records
- hosts: 70,795 records
- host_tally_buckets: 115,800 records
Acceptance Criteria
- Reproduce the issue with similar amount of data and improve this API to fix the performance issues that cause the timeout.