Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-5316

DeleteManifests requests timing out in Clair v4.5.0+

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • clair-4.7.0
    • None
    • clair

      After upgrading to Clair v4.5.0 on 10th November 2022, our DeleteManifests requests suddenly started to time out. Before this upgrade, the request usually took under 3 seconds to complete. I investigated our indexer and clair database logs and found a lot of these messages in clair's database:

      2023-04-13 13:47:48 UTC ERROR:  canceling statement due to user request
      2023-04-13 13:47:48 UTC CONTEXT:  SQL statement "DELETE FROM ONLY "public"."manifest_index" WHERE $1 OPERATOR(pg_catalog.=) "manifest_id""
      2023-04-13 13:47:48 UTC STATEMENT:  DELETE FROM manifest WHERE hash = ANY($1::TEXT[]) RETURNING manifest.hash;
      2023-04-13 13:47:48 UTC LOG:  could not send data to client: Connection reset by peer
      2023-04-13 13:47:48 UTC FATAL:  connection to client lost
      

      I suspect the problem might be related to this migration: https://github.com/quay/claircore/blob/main/datastore/postgres/migrations/indexer/05-delete-manifest-index-index.sql

      In two local deployments with Clair v4.4.4 and v4.6.0, I checked the execution plan for the following query:

      DELETE FROM ONLY public.manifest_index WHERE 1 OPERATOR(pg_catalog.=) manifest_id;
      

      Query plan in Clair v4.4.4 deployment:

      indexer=# EXPLAIN DELETE FROM ONLY public.manifest_index WHERE 1 OPERATOR(pg_catalog.=) manifest_id;
                                                                 QUERY PLAN                                                            
      ---------------------------------------------------------------------------------------------------------------------------------
       Delete on manifest_index  (cost=15.01..138.39 rows=351 width=6)
         ->  Bitmap Heap Scan on manifest_index  (cost=15.01..138.39 rows=351 width=6)
               Recheck Cond: (1 = manifest_id)
               ->  Bitmap Index Scan on manifest_index_manifest_id_package_id_dist_id_repo_id_idx  (cost=0.00..14.92 rows=351 width=0)
                     Index Cond: (manifest_id = 1)
      (5 rows)
      

      Query plan in Clair v4.6.0 deployment:

      indexer=# EXPLAIN DELETE FROM ONLY public.manifest_index WHERE 1 OPERATOR(pg_catalog.=) manifest_id;
                                     QUERY PLAN                               
      ------------------------------------------------------------------------
       Delete on manifest_index  (cost=0.00..306.20 rows=351 width=6)
         ->  Seq Scan on manifest_index  (cost=0.00..306.20 rows=351 width=6)
               Filter: (1 = manifest_id)
      (3 rows)
      

      When the index 'manifest_index_manifest_id_package_id_dist_id_repo_id_idx' is not present, the delete query uses a sequential scan, which would explain why the delete requests suddenly started timing out.

              jcroslan@redhat.com Joseph Crosland
              ktkacik@redhat.com Kristian Tkacik
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: