-
Spike
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
None
-
False
-
-
False
-
-
Goal
Investigate and document the technical feasibility, scalability constraints, and open product questions for the Organization Vulnerability Report feature (PROJQUAY-10556) — specifically around Clair API capacity, data architecture choices, and areas requiring PM clarification before engineering can begin.
Background
A codebase investigation was conducted across six areas: Clair integration, worker infrastructure, Redis caching, rate limiting, permission model, and async export patterns. Several technical ambiguities from PROJQUAY-10556 were resolved, but key questions remain that block architecture decisions.
Research Areas
1. Clair API scalability constraints
Quay queries Clair per-manifest via GET /matcher/api/v1/vulnerability_report/{hash} with a 30-second timeout. There is no batch API. For an org with 5,000+ images, querying Clair at summary generation time is not viable.
Key findings:
- Quay already stores indexing status in PostgreSQL (ManifestSecurityStatus table), but not the vulnerability counts or CVE details — those are fetched on-demand from Clair
- Vulnerability reports are cached in Redis with a 5-minute TTL (security_report_cache_ttl)
- The security scanner worker (workers/securityworker/securityworker.py) processes manifests in configurable batches (SECURITY_SCANNER_V4_BATCH_SIZE) with a default 30-second indexing interval
- There is no documented Clair-side rate limit, but throughput is bounded by the 30-second per-request timeout and Clair's matcher database performance
Implication: The org vulnerability summary worker must pre-aggregate and persist vulnerability counts rather than querying Clair at request time. This means summary data will have eventual consistency (staleness = worker refresh interval). PM needs to confirm this is acceptable.
2. Data architecture decision
Two approaches were identified:
(a) Periodic snapshot (recommended): A background worker iterates all manifests in an org, fetches vulnerability reports (from cache or Clair), aggregates counts by severity, and stores results in a new PostgreSQL table. Summaries are served from this table with Redis caching on top. This follows the existing pull-statistics pattern (workers/pullstatsredisflushworker.py).
(b) On-demand aggregation: Query Clair for every manifest at summary request time. Real-time data but scales poorly — a 5,000-image org at 30s per Clair call would take ~42 hours.
Open question for PM: Is periodic snapshot freshness (e.g., refreshed every few hours or nightly) acceptable, or do customers expect real-time data?
3. Image scope ambiguity
The feature spec does not define which images are included in the report:
- All tagged manifests across all repos? Or only the latest tag per repo?
- Manifests with IndexStatus.FAILED or MANIFEST_UNSUPPORTED — shown as "unknown" or excluded?
- Unscanned images (security scanning disabled, Clair unavailable) — how represented?
This directly impacts the data model design and worker logic.
4. Permission model gap
The feature proposes summary = org:view, export = org:admin. However, Quay has no org:view scope. The closest equivalent is OrganizationMemberPermission (any team member: admin, creator, or member).
Open question for PM: Is "any org member can view aggregated vulnerability data across all repos" acceptable from a security standpoint? Aggregated data may be more sensitive than individual image scans.
5. Export callback mechanism
The existing log export system (workers/exportactionlogsworker.py) supports callback_url (webhook) and callback_email on completion. The PROJQUAY-10556 spec only mentions status polling.
Open question for PM: Should vulnerability exports support webhook/email callbacks? For CI/CD automation (Ford's use case), webhooks are significantly more practical than polling.
6. JSON export schema format
The spec says "structured format with nested objects" but does not define a schema. If customers feed this into SIEMs or security tooling, a standard format (e.g., CycloneDX SBOM, CSAF) would reduce integration effort.
Open question for PM: Quay-specific schema or industry standard?
7. Application-level rate limiting gap
Quay's rate limiting is nginx-based (three tiers: 5/50/60 r/s). The /api/ path inherits the dynamicauth_heavy zone (5 r/s). However, for per-org rate limits on summary refresh and export requests (Q4/Q5 in PROJQUAY-10556), there is *no reusable application-level rate limiting abstraction in the codebase. This would need to be built — likely Redis-backed with per-org keys.
What the codebase investigation resolved
The following ambiguities from PROJQUAY-10556 are no longer blockers — existing patterns can be reused:
| Area | Resolution | Existing pattern to follow |
|---|---|---|
| Worker infrastructure | Use QueueWorker for exports, interval-based Worker for summary regeneration | ExportActionLogsWorker, SecurityWorker |
| Async export system | Chunked upload to distributed storage, pre-signed download URLs, adaptive batching | ExportActionLogsWorker (workers/exportactionlogsworker.py) |
| Cache architecture | Pluggable cache abstraction with per-key-type TTLs; add new org_vulnerability_summary_cache_ttl | data/cache/cache_key.py, data/cache/impl.py |
| File retention / download URL expiry | Use storage-level TTL and get_direct_download_url(expires_in=...) | Log export system |
| Permission enforcement | OrganizationMemberPermission for read, AdministerOrganizationPermission for export | endpoints/api/organization.py |
Deliverables
- Written summary of findings (this ticket)
- Comment posted on PROJQUAY-10556 with PM clarification questions
- Once PM answers are received: input to PROJQUAY-10556 acceptance criteria and architecture design
- relates to
-
PROJQUAY-10556 Organization Vulnerability Report
-
- New
-