Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-851

Investigate Quay's usage of gunicorn and identify obvious issues/improvements

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • quay
    • None
    • Gunicorn Investigation
    • Improvement
    • Quay
    • Done
    • 0% To Do, 0% In Progress, 100% Done

      Description copied from PROJQUAY-729

      I have been thinking that the Quay database exhausting its number of connections during an outage is actually a cascading failure from the application servers. By reviewing the gunicorn config, we should be able tighten up control over some variables that could prevent these failures from happening in the future.

      Here are some notes that I compiled in a brief spike on the idea:

      * Configure the worker temp directory to use memory
        * What
          * This avoids all the filesystem overhead when gunicorn workers are trying to heartbeat
        * Why
          * This will make Quay less susceptible to noisy neighbors or changes to container engine drivers.
          * Reduces gunicorn coordination time, which could cause faster startups.
        * Documentation
          * https://docs.gunicorn.org/en/stable/settings.html?highlight=bind#worker-tmp-dir
          * https://pythonspeed.com/articles/gunicorn-in-docker
      
      * Configure the maximum number of worker connections per gunicorn worker
        * What
          * This is the resource boundary for greenlet concurrency
          * We're using the default limit (1000).
        * Why
          * By using this limit, we can control the maximum number of resources (open FDs, DB conns, etc...) each Quay instance can create.
        * Follow ups
          * Review DB connection pooling maximum connections and stale timeouts.
          * greenlets * connections per pool * gunicorn workers * pods = max db connections
        * Documentation
          * https://docs.gunicorn.org/en/stable/settings.html?highlight=bind#worker-connections
          * https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7#a258
      
      * Only ever use the recommended gunicorn worker count
        * What
          * This is the number of gunicorn worker processes that are forked.
          * The recommended value is (2*CPU)+1 or (4*CPU)+1.
        * Why
          * The number of workers is generally not the knob you're supposed to tune for concurrency.
          * In order to apply the scientific method more effectively when profiling performance, we should remove unnecessarily variable values.
        * Existing code
          * https://github.com/quay/quay/blob/fd9975d20f59a69789fef992bb09ae6d24f7fbbd/util/workers.py
            * Gunicorn already has an environment variable to override the default, `WEB_CONCURRENCY`.
        * Documentation
          * https://docs.gunicorn.org/en/stable/settings.html?highlight=bind#worker-processes
      
      * Create a Prometheus Gauge for number of gunicorn workers
        * What
          * This is a metric we could add to Quay that would track the number of running gunicorn workers.
        * Why
          * This will allow us to see if gunicorn processes are exiting unexpectedly.
        * Documentation
          * https://docs.gunicorn.org/en/stable/settings.html?highlight=bind#nworkers-changed
      

              Unassigned Unassigned
              kmullins@redhat.com Kurtis Mullins (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: