Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: clair-4.7.0
Affects Version/s: None
Component/s: clair
Labels:
- triaged

Blocked:
False
Blocked Reason:
None
Ready:
False
Git Pull Request:
https://github.com/quay/claircore/pull/911
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

After upgrading to Clair v4.5.0 on 10th November 2022, our DeleteManifests requests suddenly started to time out. Before this upgrade, the request usually took under 3 seconds to complete. I investigated our indexer and clair database logs and found a lot of these messages in clair's database:

2023-04-13 13:47:48 UTC ERROR:  canceling statement due to user request
2023-04-13 13:47:48 UTC CONTEXT:  SQL statement "DELETE FROM ONLY "public"."manifest_index" WHERE $1 OPERATOR(pg_catalog.=) "manifest_id""
2023-04-13 13:47:48 UTC STATEMENT:  DELETE FROM manifest WHERE hash = ANY($1::TEXT[]) RETURNING manifest.hash;
2023-04-13 13:47:48 UTC LOG:  could not send data to client: Connection reset by peer
2023-04-13 13:47:48 UTC FATAL:  connection to client lost

I suspect the problem might be related to this migration: https://github.com/quay/claircore/blob/main/datastore/postgres/migrations/indexer/05-delete-manifest-index-index.sql

In two local deployments with Clair v4.4.4 and v4.6.0, I checked the execution plan for the following query:

DELETE FROM ONLY public.manifest_index WHERE 1 OPERATOR(pg_catalog.=) manifest_id;

Query plan in Clair v4.4.4 deployment:

indexer=# EXPLAIN DELETE FROM ONLY public.manifest_index WHERE 1 OPERATOR(pg_catalog.=) manifest_id;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Delete on manifest_index  (cost=15.01..138.39 rows=351 width=6)
   ->  Bitmap Heap Scan on manifest_index  (cost=15.01..138.39 rows=351 width=6)
         Recheck Cond: (1 = manifest_id)
         ->  Bitmap Index Scan on manifest_index_manifest_id_package_id_dist_id_repo_id_idx  (cost=0.00..14.92 rows=351 width=0)
               Index Cond: (manifest_id = 1)
(5 rows)

Query plan in Clair v4.6.0 deployment:

indexer=# EXPLAIN DELETE FROM ONLY public.manifest_index WHERE 1 OPERATOR(pg_catalog.=) manifest_id;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Delete on manifest_index  (cost=0.00..306.20 rows=351 width=6)
   ->  Seq Scan on manifest_index  (cost=0.00..306.20 rows=351 width=6)
         Filter: (1 = manifest_id)
(3 rows)

When the index 'manifest_index_manifest_id_package_id_dist_id_repo_id_idx' is not present, the delete query uses a sequential scan, which would explain why the delete requests suddenly started timing out.

Assignee:: Joseph Crosland

Reporter:: Kristian Tkacik

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/04/18 3:58 PM

Updated:: 2023/06/05 3:14 PM

Resolved:: 2023/06/05 3:14 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates