Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-10702

Spike: Organization Vulnerability Report — data architecture, Clair scalability, and open product questions

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Goal

      Investigate and document the technical feasibility, scalability constraints, and open product questions for the Organization Vulnerability Report feature (PROJQUAY-10556) — specifically around Clair API capacity, data architecture choices, and areas requiring PM clarification before engineering can begin.

      Background

      A codebase investigation was conducted across six areas: Clair integration, worker infrastructure, Redis caching, rate limiting, permission model, and async export patterns. Several technical ambiguities from PROJQUAY-10556 were resolved, but key questions remain that block architecture decisions.

      Research Areas

      1. Clair API scalability constraints

      Quay queries Clair per-manifest via GET /matcher/api/v1/vulnerability_report/{hash} with a 30-second timeout. There is no batch API. For an org with 5,000+ images, querying Clair at summary generation time is not viable.

      Key findings:

      • Quay already stores indexing status in PostgreSQL (ManifestSecurityStatus table), but not the vulnerability counts or CVE details — those are fetched on-demand from Clair
      • Vulnerability reports are cached in Redis with a 5-minute TTL (security_report_cache_ttl)
      • The security scanner worker (workers/securityworker/securityworker.py) processes manifests in configurable batches (SECURITY_SCANNER_V4_BATCH_SIZE) with a default 30-second indexing interval
      • There is no documented Clair-side rate limit, but throughput is bounded by the 30-second per-request timeout and Clair's matcher database performance

      Implication: The org vulnerability summary worker must pre-aggregate and persist vulnerability counts rather than querying Clair at request time. This means summary data will have eventual consistency (staleness = worker refresh interval). PM needs to confirm this is acceptable.

      2. Data architecture decision

      Two approaches were identified:

      (a) Periodic snapshot (recommended): A background worker iterates all manifests in an org, fetches vulnerability reports (from cache or Clair), aggregates counts by severity, and stores results in a new PostgreSQL table. Summaries are served from this table with Redis caching on top. This follows the existing pull-statistics pattern (workers/pullstatsredisflushworker.py).

      (b) On-demand aggregation: Query Clair for every manifest at summary request time. Real-time data but scales poorly — a 5,000-image org at 30s per Clair call would take ~42 hours.

      Open question for PM: Is periodic snapshot freshness (e.g., refreshed every few hours or nightly) acceptable, or do customers expect real-time data?

      3. Image scope ambiguity

      The feature spec does not define which images are included in the report:

      • All tagged manifests across all repos? Or only the latest tag per repo?
      • Manifests with IndexStatus.FAILED or MANIFEST_UNSUPPORTED — shown as "unknown" or excluded?
      • Unscanned images (security scanning disabled, Clair unavailable) — how represented?

      This directly impacts the data model design and worker logic.

      4. Permission model gap

      The feature proposes summary = org:view, export = org:admin. However, Quay has no org:view scope. The closest equivalent is OrganizationMemberPermission (any team member: admin, creator, or member).

      Open question for PM: Is "any org member can view aggregated vulnerability data across all repos" acceptable from a security standpoint? Aggregated data may be more sensitive than individual image scans.

      5. Export callback mechanism

      The existing log export system (workers/exportactionlogsworker.py) supports callback_url (webhook) and callback_email on completion. The PROJQUAY-10556 spec only mentions status polling.

      Open question for PM: Should vulnerability exports support webhook/email callbacks? For CI/CD automation (Ford's use case), webhooks are significantly more practical than polling.

      6. JSON export schema format

      The spec says "structured format with nested objects" but does not define a schema. If customers feed this into SIEMs or security tooling, a standard format (e.g., CycloneDX SBOM, CSAF) would reduce integration effort.

      Open question for PM: Quay-specific schema or industry standard?

      7. Application-level rate limiting gap

      Quay's rate limiting is nginx-based (three tiers: 5/50/60 r/s). The /api/ path inherits the dynamicauth_heavy zone (5 r/s). However, for per-org rate limits on summary refresh and export requests (Q4/Q5 in PROJQUAY-10556), there is *no reusable application-level rate limiting abstraction in the codebase. This would need to be built — likely Redis-backed with per-org keys.

      What the codebase investigation resolved

      The following ambiguities from PROJQUAY-10556 are no longer blockers — existing patterns can be reused:

      Area Resolution Existing pattern to follow
      Worker infrastructure Use QueueWorker for exports, interval-based Worker for summary regeneration ExportActionLogsWorker, SecurityWorker
      Async export system Chunked upload to distributed storage, pre-signed download URLs, adaptive batching ExportActionLogsWorker (workers/exportactionlogsworker.py)
      Cache architecture Pluggable cache abstraction with per-key-type TTLs; add new org_vulnerability_summary_cache_ttl data/cache/cache_key.py, data/cache/impl.py
      File retention / download URL expiry Use storage-level TTL and get_direct_download_url(expires_in=...) Log export system
      Permission enforcement OrganizationMemberPermission for read, AdministerOrganizationPermission for export endpoints/api/organization.py

      Deliverables

      • Written summary of findings (this ticket)
      • Comment posted on PROJQUAY-10556 with PM clarification questions
      • Once PM answers are received: input to PROJQUAY-10556 acceptance criteria and architecture design

              rh-ee-srmisra Sridipta Misra
              marckok Marcus Kok
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: