-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
None
-
False
-
-
False
-
subs-swatch-thunder
-
-
The /v1/instances/products/<PRODUCT_ID> endpoint is timing out for customers with large amounts of data, causing 500 errors and a broken user experience in the Subscriptions Usage page.
We've identified that the query on tally_instance_non_payg_view joined with instance_measurements exceeds the PostgreSQL statement timeout for organizations with significant data volumes. The gateway returns a 009 status code (NGINX client-closed connection) after 30 seconds, while the database eventually cancels the query due to statement timeout.
The problematic query is:
SELECT tinpv1_0.bucket_billing_account_id, tinpv1_0.bucket_billing_provider,
tinpv1_0.instance_id, tinpv1_0.measurement_type, tinpv1_0.product_id,
tinpv1_0.sla, tinpv1_0.usage, tinpv1_0.cores, tinpv1_0.display_name,
tinpv1_0.id, tinpv1_0.host_billing_account_id, tinpv1_0.host_billing_provider,
tinpv1_0.hypervisor_uuid, tinpv1_0.inventory_id,
tinpv1_0.last_applied_event_record_date, tinpv1_0.last_seen, tinpv1_0.metrics,
tinpv1_0.num_of_guests, tinpv1_0.org_id, tinpv1_0.sockets,
tinpv1_0.subscription_manager_id
FROM tally_instance_non_payg_view tinpv1_0
LEFT JOIN instance_measurements fm1_0 ON tinpv1_0.id = fm1_0.host_id
AND fm1_0.metric_id = ?
WHERE tinpv1_0.sockets >= ?
AND tinpv1_0.org_id = ?
AND tinpv1_0.product_id = ?
AND tinpv1_0.sla = ?
AND tinpv1_0.usage = ?
AND tinpv1_0.bucket_billing_provider = ?
AND tinpv1_0.bucket_billing_account_id = ?
AND fm1_0.value IS NOT NULL
ORDER BY tinpv1_0.last_seen DESC NULLS LAST, tinpv1_0.id
OFFSET ? ROWS FETCH FIRST ? ROWS ONLY
Data volume for affected organization:
- tally_snapshots: 890,082 records
- tally_measurements: 2,464,843 records
- hosts: 70,795 records
- host_tally_buckets: 115,800 records
The tally_instance_non_payg_view is a database view that queries multiple tables, and the combination with pagination (OFFSET/FETCH) and ordering by last_seen DESC creates performance issues for large datasets.
Acceptance Criteria
- Reproduce the issue with similar amount of data and improve this API to fix the performance issues that cause the timeout.