Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: quayio
Labels:
None

Story Points:
13
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Maintainers of the quayio Clair instance would like to continuously deploy Clair's main in such a way that does not involve running two versions simultaneously.

Background

Currently our workflow is as follows:

Code merged into main is deployed to stage.
Once functionality is verified in stage, we make a PR to app-interface to promote that specific image to production.
Check error rate and deploy fixes as appropriate.

Problem

The above workflow has a couple of issues:

There is very little traffic to stage so catching any errors/verifying results there is difficult, the traffic is not representative.
Because our input (any container image (or not) someone chooses to push up to the internet) it is difficult to test code resilience before it hits the production, therefore production deployments are riskier than desired.

Proposed solution

We would like to deploy main to another canary cluster that receives a sample-copy of production traffic and which we can do error-rate and result comparison analysis on. One way to do this would be to leverage a similar workflow to the Clair backfill worker that was used to backfill the ClairV4 database before going into to production. The traffic from the backfill/siphon worker would read from the production Quay DB (or most likely a replica would suffice), most likely the secscan worker code would need to be modified to no-op the write path as it would fail anyway on a read-replica. The traffic from the siphon-worker would fire n requests at the canary cluster which would expose metrics where we could analyze error rates.

A post process would take the results from the manifests that have been indexed and compare them against production index reports to ensure we have not regressed. Results from these tests should be easily consumed by the Clair team.

Considerations

Which Quay DB to point the siphon-worker to? Preferably a read-replica
How to keep the secscan-worker and the siphon worker in sync? Import the secscan model from Quay and update it as a normal dep?
Would the new canary cluster be run by app-sre?
Could we automatically hook into the app-sre grafana instance?
How/where would be surface comparison results?
How/where to define comparison rules?
The cost of adding to production traffic for comparison step
Production S3 access
Triggers (nightly? on push to main?)

Assignee:: Unassigned

Reporter:: Joseph Crosland

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/02/15 10:15 PM

Updated:: 2025/04/17 8:41 PM

Details

Description

Background

Problem

Proposed solution

Considerations

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty