Loading...

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: clair-4.8.1, claircore-1.5.44
Affects Version/s: clair-4.8.0
Component/s: matcher
Labels:
- needs-info

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Git Pull Request:
https://github.com/quay/claircore/pull/1655
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

This came from the IBM Cloud Container Registry team:

We’ve recently enabled the CVSS updater in our development config so that we can provide CVSS data in the vulnerability reports, but we’re running into some odd Postgresql v16 behaviour when fetching the CVSS data, when the [ getEnrichments SQL query|https://github.com/quay/claircore/blob/v1.5.32/datastore/postgres/enrichment.go#L209-L233] is run for vulnerability reports. We’re hoping that something here might ring a bell with you, or that perhaps you have some insight into Postgresql internals that we are lacking – we’re very much coming up against the limit of our Postgresql knowledge, so any light or theories that you could share would be incredibly valuable to us.

In general what we’re seeing is occasional very very high durations for this getEnrichments query being run. We have continuous tests that run hourly, and usually the tests run with an acceptable response time. However we have seen the database be in a state for long periods of time (hours or days) where our end-to-end requests suddenly take much, much longer.

We’ve looked at the Grafana heatmap for the claircore_vulnstore_getenrichments_duration_seconds metric, and in the good case this query is running in around 1-2 milliseconds. In the bad case though, this single query regularly takes around 300 milliseconds, and sometimes up to 2-3 seconds.

The badly-performing cases aren’t evenly distributed; we’ll get batch periods of hours at a time (or sometimes days) where all such queries are taking 300-3000 milliseconds instead of 1-2 milliseconds; when the database has got itself into a particular internal state. Needless to say, this kills our performance and we can’t release this to production until we nail down the root cause and how to mitigate it.

One pattern that we did see after an initial period of solidly bad performance for days when we first enabled the feature was that it “settled down” and then we would only see periods of bad performance once per day. Always at the same time each day – the hour following 03:00 US Eastern time, which is when NVD releases its CVSS updates (e.g. see this 2025 metadata file with timestamp https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-2025.meta)

So it appears to be connected to a change in state that happens in the database when there are CVSS updates. Things that happen in tables that relate to this query are:

The enrichment table gets a few updates for new or changed CVSS data

The uo_enrich join table gets ~150,000 new rows written (one for each piece of CVSS data / each CVE, for this update operation), and has ~150,000 oldest rows deleted shortly afterwards according to the updateRetention configuration setting.

We’ve narrowed down the very stark change in performance to the way that Postgresql is choosing to execute the getEnrichments query by running this SQL query directly against the database during these periods with EXPLAIN ANALYZE.

In the good case, this query plan is executed. The pertinent thing to note here is that it’s using Index Only Scan using uo_enrich_pkey on public.uo_enrich uo to correctly fetch the one matching row from the primary key index of the join table using the IDs for the primary key.

Attached good_explain_analyze.txt

In the bad case, the following query plan is executed instead. Note that here the Index Only Scan using the primary key index is not used. Instead, there is a Seq Scan on public.uo_enrich uo that fetches back 150,000 rows from this join table, and only later is it filtered down to the one row that it actually needs.

Attached bad_explain_analyze.txt

We can reproduce this consistenly when the database is in a “bad period” which, again, can last for multiple days before being rectified. What this means is, during this period, Postgresql’s query planner is evaluating all the possible plans to satisfy that query, assigning cost estimates to each of them, and rejecting the query with the index scan (which we know would be efficient!) and instead choosing the query with the sequential table scan, which is very inefficient. That seems to indicate that the knowledge that Postgres uses in order to estimate the cost of a plan (the statistics about the tables available through pg_stats) is wrong or unrepresentative in a way that leads it to make wildly incorrect guesses about the cost of particular plans.

Those internal table statistics in pg_stats are evaluated and updated following an ANALYZE / auto analyze. So it seems that there might be some window where the auto analyze is running that’s generating a bad set of statistics that’s unrepresentative of what’s actually in the table, so the query planner is being led down a bad path – or some other reason that misleading statistics are generated.

We’ve observed a state change coinciding with when the CVSS updates are happening – so for the uo_enrich join table, there will be a transactional insert of 150,000 new rows, and then shortly afterwards the oldest 150,000 rows are deleted according to Clair’s updateRetention setting. We have a possible theory that perhaps an auto analyze could be happening within the window when the churn is happening, and thus the statistics recorded by the analyze operation are unrepresentative of the true state of the table most of the time (but have not been able to prove this so far).

Another possible theory is to do with the size of our uo_enrich table – in Clair, the default setting of updateRetention is 10, but we have it set to 2, so we only have one latest copy and 2 older copies of update operations. In practice, this means the size of our uo_enrich table is smaller than in a default installation, but there will be a greater churn (as a percentage of the table) each day when the inserts/deletes happen – one-third of the table’s data is being rotated every day rather than a typical one-tenth. We wonder if this could be tied somehow with the bad data statistics and which scan type the query planner is choosing – e.g. if the query planner believes that it wants to retrieve >70% of a table in a search, that’s a circumstance in which it will choose to do a sequential scan instead of an index scan because it believes that will be more efficient. It’s hard to tie this together though, but it seems like a possible factor worth mentioning.

Apologies for the large dump of information! Ultimately there’s some internal state that is causing the query planner to make tremendously bad guesses about the cost estimate of various query plans, and leading it to choose an inefficient plan (until the internal state is changed again at some point, at which point it will consistently choose the correct, efficient plan using the primary key index on the join table). We’d be very grateful for any thoughts you might have or suggestions of tests or experiments we could try to pinpoint the behaviour.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

bad_explain_analyze.txt
2 kB
2025/06/04 6:51 PM
good_explain_analyze.txt
2 kB
2025/06/04 6:51 PM

mentioned in: Page Loading...

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty

Hide