Loading...

Type: Spike
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Goal

Investigate and document the technical feasibility, scalability constraints, and open product questions for the Organization Vulnerability Report feature (PROJQUAY-10556) — specifically around Clair API capacity, data architecture choices, and areas requiring PM clarification before engineering can begin.

Background

A codebase investigation was conducted across six areas: Clair integration, worker infrastructure, Redis caching, rate limiting, permission model, and async export patterns. Several technical ambiguities from PROJQUAY-10556 were resolved, but key questions remain that block architecture decisions.

Research Areas

1. Clair API scalability constraints

Quay queries Clair per-manifest via GET /matcher/api/v1/vulnerability_report/{hash} with a 30-second timeout. There is no batch API. For an org with 5,000+ images, querying Clair at summary generation time is not viable.

Key findings:

Quay already stores indexing status in PostgreSQL (ManifestSecurityStatus table), but not the vulnerability counts or CVE details — those are fetched on-demand from Clair
Vulnerability reports are cached in Redis with a 5-minute TTL (security_report_cache_ttl)
The security scanner worker (workers/securityworker/securityworker.py) processes manifests in configurable batches (SECURITY_SCANNER_V4_BATCH_SIZE) with a default 30-second indexing interval
There is no documented Clair-side rate limit, but throughput is bounded by the 30-second per-request timeout and Clair's matcher database performance

Implication: The org vulnerability summary worker must pre-aggregate and persist vulnerability counts rather than querying Clair at request time. This means summary data will have eventual consistency (staleness = worker refresh interval). PM needs to confirm this is acceptable.

2. Data architecture decision

Two approaches were identified:

(a) Periodic snapshot (recommended): A background worker iterates all manifests in an org, fetches vulnerability reports (from cache or Clair), aggregates counts by severity, and stores results in a new PostgreSQL table. Summaries are served from this table with Redis caching on top. This follows the existing pull-statistics pattern (workers/pullstatsredisflushworker.py).

(b) On-demand aggregation: Query Clair for every manifest at summary request time. Real-time data but scales poorly — a 5,000-image org at 30s per Clair call would take ~42 hours.

Open question for PM: Is periodic snapshot freshness (e.g., refreshed every few hours or nightly) acceptable, or do customers expect real-time data?

3. Image scope ambiguity

The feature spec does not define which images are included in the report:

All tagged manifests across all repos? Or only the latest tag per repo?
Manifests with IndexStatus.FAILED or MANIFEST_UNSUPPORTED — shown as "unknown" or excluded?
Unscanned images (security scanning disabled, Clair unavailable) — how represented?

This directly impacts the data model design and worker logic.

4. Permission model gap

The feature proposes summary = org:view, export = org:admin. However, Quay has no org:view scope. The closest equivalent is OrganizationMemberPermission (any team member: admin, creator, or member).

Open question for PM: Is "any org member can view aggregated vulnerability data across all repos" acceptable from a security standpoint? Aggregated data may be more sensitive than individual image scans.

5. Export callback mechanism

The existing log export system (workers/exportactionlogsworker.py) supports callback_url (webhook) and callback_email on completion. The PROJQUAY-10556 spec only mentions status polling.

Open question for PM: Should vulnerability exports support webhook/email callbacks? For CI/CD automation (Ford's use case), webhooks are significantly more practical than polling.

6. JSON export schema format

The spec says "structured format with nested objects" but does not define a schema. If customers feed this into SIEMs or security tooling, a standard format (e.g., CycloneDX SBOM, CSAF) would reduce integration effort.

Open question for PM: Quay-specific schema or industry standard?

7. Application-level rate limiting gap

Quay's rate limiting is nginx-based (three tiers: 5/50/60 r/s). The /api/ path inherits the dynamicauth_heavy zone (5 r/s). However, for per-org rate limits on summary refresh and export requests (Q4/Q5 in PROJQUAY-10556), there is *no reusable application-level rate limiting abstraction in the codebase. This would need to be built — likely Redis-backed with per-org keys.

What the codebase investigation resolved

The following ambiguities from PROJQUAY-10556 are no longer blockers — existing patterns can be reused:

Area	Resolution	Existing pattern to follow
Worker infrastructure	Use `QueueWorker` for exports, interval-based `Worker` for summary regeneration	`ExportActionLogsWorker`, `SecurityWorker`
Async export system	Chunked upload to distributed storage, pre-signed download URLs, adaptive batching	`ExportActionLogsWorker` (`workers/exportactionlogsworker.py`)
Cache architecture	Pluggable cache abstraction with per-key-type TTLs; add new `org_vulnerability_summary_cache_ttl`	`data/cache/cache_key.py`, `data/cache/impl.py`
File retention / download URL expiry	Use storage-level TTL and `get_direct_download_url(expires_in=...)`	Log export system
Permission enforcement	`OrganizationMemberPermission` for read, `AdministerOrganizationPermission` for export	`endpoints/api/organization.py`

Deliverables

Written summary of findings (this ticket)
Comment posted on PROJQUAY-10556 with PM clarification questions
Once PM answers are received: input to PROJQUAY-10556 acceptance criteria and architecture design

relates to

PROJQUAY-10556 Organization Vulnerability Report

New

Details

Description

Goal

Background

Research Areas

1. Clair API scalability constraints

2. Data architecture decision

3. Image scope ambiguity

4. Permission model gap

5. Export callback mechanism

6. JSON export schema format

7. Application-level rate limiting gap

What the codebase investigation resolved

Deliverables

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates