Uploaded image for project: 'Cost Management'
  1. Cost Management
  2. COST-1013

GCP Downloader downloads empty file for previous month(s)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • 2021Q4
    • None
    • Data Pipeline
    • None
    • False
    • False
    • Undefined

      Description

      When running the downloader during the current month, we still end up downloading an empty CSV for the previous month, scanning for dates in the current month

      In the follow example we query January from Feb 7 - 10 and end up with an empty CSV.

      [2021-02-10 19:50:46,749] INFO e60e5986-d686-40ef-854b-6044f37bd1d0 Using querying for invoice_month (202101)
      [2021-02-10 19:50:47,910] INFO e60e5986-d686-40ef-854b-6044f37bd1d0 {'message': 'Local filename: 202101_389b53cf8c64902596eeb02bfcbe015e_2021-02-07:2021-02-10.csv', 'request_id': '722f61e7-76f1-4910-88ba-6a6438719b49', 'provider_uuid': '40871eab-6324-4f75-87e8-bb055aada66d', 'account': '10001'}
      [2021-02-10 19:50:47,912] INFO e60e5986-d686-40ef-854b-6044f37bd1d0 {'message': 'Downloading 202101_389b53cf8c64902596eeb02bfcbe015e_2021-02-07:2021-02-10.csv to /testing/pvc_dir/processing/acct10001/gcp/202101_389b53cf8c64902596eeb02bfcbe015e_2021-02-07:2021-02-10.csv', 'request_id': '722f61e7-76f1-4910-88ba-6a6438719b49', 'provider_uuid': '40871eab-6324-4f75-87e8-bb055aada66d', 'account': '10001'}
      [2021-02-10 19:50:48,726] INFO e60e5986-d686-40ef-854b-6044f37bd1d0 {'message': 'Returning full_file_path: /testing/pvc_dir/processing/acct10001/gcp/202101_389b53cf8c64902596eeb02bfcbe015e_2021-02-07:2021-02-10.csv', 'request_id': '722f61e7-76f1-4910-88ba-6a6438719b49', 'provider_uuid': '40871eab-6324-4f75-87e8-bb055aada66d', 'account': '10001'}
      

      Proposed Solution

      We can use the export_time column on the BigQuery table to determine if a previous month has updated data. We can update the etag generation method to query that column and build the etag based on that. Since we check the max of a column it uses the column statistics and is a very small and fast query.

      Example

      SELECT max(export_time) FROM `{table}` WHERE DATE(_PARTITIONTIME) >= '2021-01-01' AND DATE(_PARTITIONTIME) < '2021-02-01'
      

            myersco Cody Myers
            aberglun@redhat.com Andrew Berglund
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: