Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-3716

Add SLI (and SLO) for quay.io vulnerabilities API scan latency

    XMLWordPrintable

Details

    • Spike
    • Resolution: Done
    • Undefined
    • None
    • None
    • quay.io
    • 0

    Description

      In HACBS, we intend to submit requests to the clair instance running as a part of quay.io to scan images and retrieve results about CVEs in those images.

      We found that these scan results frequently take longer than we expected and that they may be degrading or getting worse over time.

      From kpavic@redhat.com and jsztuka's investigations:

      The issue with getting the vulnerabilities from quay was that the HTTP response would always be 200, but the response contents would indicate that the Clair scan was queued (and it would never finish) Here's an example of such request for an image that was pushed during the outage 2 weeks ago and still hasn't been scanned:

      curl -H "Content-type: application/json" -XGET https://quay.io/api/v1/repository/jsztuka/tester-s/manifest/sha256%3A07c2a8db4ecff9924b67200180f70aaead669f57b51af708bb9d87357aa1687a/security?vulnerabilities=true

      Compared to the one that was scanned: 

      curl -H "Content-type: application/json" -XGET https://quay.io/api/v1/repository/jsztuka/tester-s/manifest/sha256%3A24c0a33d297f04022bf603ad887a7466cdff908caf67794d7acc2e405650e796/security?vulnerabilities=true

      Furthermore, a month or so ago some new queue-jumping capability was implemented that should have made things faster, but our anecdotal user experience indicates that things have actually gotten slower in the last month.

      I think it makes sense to add a real SLI for the quayio vulnerabilities that tracks the latency in producing clair scans. Not HTTP response times, but queued scan times. Having an SLI for this would give quay/clair PM and Engineering data about how this is behaving. Setting an SLO for this would help teams like ours make decisions about whether or not we should depend on it and help us understand when we should or shouldn't escalate.

      Some SLO that reads like "95% of quay vulnerabilities API scans will be completed in under 7 minutes" would be a good structure. We can play with those numbers and thresholds to find something right, as determined by quay/clair PM and Eng.

      The clair SLIs and SLOs are defined here: https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/clair/slo-documents/clair.yaml, but I cannot find a corresponding SLO document for quayio.

      I hear that quayio is in the process now of updating SLIs and SLOs. I would like to provide this request as input to that process, and track the output.

      Attachments

        Issue Links

          Activity

            People

              marckok Marcus Kok
              rbean@redhat.com Ralph Bean
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: