Uploaded image for project: 'Subscription Watch'
  1. Subscription Watch
  2. SWATCH-4390

Instances Products API times out for customers with large datasets

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • swatch-api
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • subs-swatch-thunder

      The /v1/instances/products/<PRODUCT_ID> endpoint is timing out for customers with large amounts of data, causing 500 errors and a broken user experience in the Subscriptions Usage page.

      We've identified that the query on tally_instance_non_payg_view joined with instance_measurements exceeds the PostgreSQL statement timeout for organizations with significant data volumes. The gateway returns a 009 status code (NGINX client-closed connection) after 30 seconds, while the database eventually cancels the query due to statement timeout.

      The problematic query is:

      SELECT tinpv1_0.bucket_billing_account_id, tinpv1_0.bucket_billing_provider, 
             tinpv1_0.instance_id, tinpv1_0.measurement_type, tinpv1_0.product_id, 
             tinpv1_0.sla, tinpv1_0.usage, tinpv1_0.cores, tinpv1_0.display_name, 
             tinpv1_0.id, tinpv1_0.host_billing_account_id, tinpv1_0.host_billing_provider, 
             tinpv1_0.hypervisor_uuid, tinpv1_0.inventory_id, 
             tinpv1_0.last_applied_event_record_date, tinpv1_0.last_seen, tinpv1_0.metrics, 
             tinpv1_0.num_of_guests, tinpv1_0.org_id, tinpv1_0.sockets, 
             tinpv1_0.subscription_manager_id 
      FROM tally_instance_non_payg_view tinpv1_0 
      LEFT JOIN instance_measurements fm1_0 ON tinpv1_0.id = fm1_0.host_id 
        AND fm1_0.metric_id = ? 
      WHERE tinpv1_0.sockets >= ? 
        AND tinpv1_0.org_id = ? 
        AND tinpv1_0.product_id = ? 
        AND tinpv1_0.sla = ? 
        AND tinpv1_0.usage = ? 
        AND tinpv1_0.bucket_billing_provider = ? 
        AND tinpv1_0.bucket_billing_account_id = ? 
        AND fm1_0.value IS NOT NULL 
      ORDER BY tinpv1_0.last_seen DESC NULLS LAST, tinpv1_0.id 
      OFFSET ? ROWS FETCH FIRST ? ROWS ONLY
      

      Data volume for affected organization:

      • tally_snapshots: 890,082 records
      • tally_measurements: 2,464,843 records
      • hosts: 70,795 records
      • host_tally_buckets: 115,800 records

      The tally_instance_non_payg_view is a database view that queries multiple tables, and the combination with pagination (OFFSET/FETCH) and ordering by last_seen DESC creates performance issues for large datasets.

      Acceptance Criteria

      • Reproduce the issue with similar amount of data and improve this API to fix the performance issues that cause the timeout.

              Unassigned Unassigned
              jcarvaja@redhat.com Jose Carvajal Hilario
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: