-
Bug
-
Resolution: Done
-
Undefined
-
None
-
False
-
None
-
False
-
-
After upgrading to Clair v4.5.0 on 10th November 2022, our DeleteManifests requests suddenly started to time out. Before this upgrade, the request usually took under 3 seconds to complete. I investigated our indexer and clair database logs and found a lot of these messages in clair's database:
2023-04-13 13:47:48 UTC ERROR: canceling statement due to user request 2023-04-13 13:47:48 UTC CONTEXT: SQL statement "DELETE FROM ONLY "public"."manifest_index" WHERE $1 OPERATOR(pg_catalog.=) "manifest_id"" 2023-04-13 13:47:48 UTC STATEMENT: DELETE FROM manifest WHERE hash = ANY($1::TEXT[]) RETURNING manifest.hash; 2023-04-13 13:47:48 UTC LOG: could not send data to client: Connection reset by peer 2023-04-13 13:47:48 UTC FATAL: connection to client lost
I suspect the problem might be related to this migration: https://github.com/quay/claircore/blob/main/datastore/postgres/migrations/indexer/05-delete-manifest-index-index.sql
In two local deployments with Clair v4.4.4 and v4.6.0, I checked the execution plan for the following query:
DELETE FROM ONLY public.manifest_index WHERE 1 OPERATOR(pg_catalog.=) manifest_id;
Query plan in Clair v4.4.4 deployment:
indexer=# EXPLAIN DELETE FROM ONLY public.manifest_index WHERE 1 OPERATOR(pg_catalog.=) manifest_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Delete on manifest_index (cost=15.01..138.39 rows=351 width=6)
-> Bitmap Heap Scan on manifest_index (cost=15.01..138.39 rows=351 width=6)
Recheck Cond: (1 = manifest_id)
-> Bitmap Index Scan on manifest_index_manifest_id_package_id_dist_id_repo_id_idx (cost=0.00..14.92 rows=351 width=0)
Index Cond: (manifest_id = 1)
(5 rows)
Query plan in Clair v4.6.0 deployment:
indexer=# EXPLAIN DELETE FROM ONLY public.manifest_index WHERE 1 OPERATOR(pg_catalog.=) manifest_id;
QUERY PLAN
------------------------------------------------------------------------
Delete on manifest_index (cost=0.00..306.20 rows=351 width=6)
-> Seq Scan on manifest_index (cost=0.00..306.20 rows=351 width=6)
Filter: (1 = manifest_id)
(3 rows)
When the index 'manifest_index_manifest_id_package_id_dist_id_repo_id_idx' is not present, the delete query uses a sequential scan, which would explain why the delete requests suddenly started timing out.