-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
13
-
False
-
-
False
-
-
Maintainers of the quayio Clair instance would like to continuously deploy Clair's main in such a way that does not involve running two versions simultaneously.
Background
Currently our workflow is as follows:
- Code merged into main is deployed to stage.
- Once functionality is verified in stage, we make a PR to app-interface to promote that specific image to production.
- Check error rate and deploy fixes as appropriate.
Problem
The above workflow has a couple of issues:
- There is very little traffic to stage so catching any errors/verifying results there is difficult, the traffic is not representative.
- Because our input (any container image (or not) someone chooses to push up to the internet) it is difficult to test code resilience before it hits the production, therefore production deployments are riskier than desired.
Proposed solution
We would like to deploy main to another canary cluster that receives a sample-copy of production traffic and which we can do error-rate and result comparison analysis on. One way to do this would be to leverage a similar workflow to the Clair backfill worker that was used to backfill the ClairV4 database before going into to production. The traffic from the backfill/siphon worker would read from the production Quay DB (or most likely a replica would suffice), most likely the secscan worker code would need to be modified to no-op the write path as it would fail anyway on a read-replica. The traffic from the siphon-worker would fire n requests at the canary cluster which would expose metrics where we could analyze error rates.
A post process would take the results from the manifests that have been indexed and compare them against production index reports to ensure we have not regressed. Results from these tests should be easily consumed by the Clair team.
Considerations
- Which Quay DB to point the siphon-worker to? Preferably a read-replica
- How to keep the secscan-worker and the siphon worker in sync? Import the secscan model from Quay and update it as a normal dep?
- Would the new canary cluster be run by app-sre?
- Could we automatically hook into the app-sre grafana instance?
- How/where would be surface comparison results?
- How/where to define comparison rules?
- The cost of adding to production traffic for comparison step
- Production S3 access
- Triggers (nightly? on push to main?)