Loading...

XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: quay
Labels:
None

Epic Name:
Gunicorn Investigation
Work Type:
Improvement
Backlogs:

Quay
Epic Status:
Done
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description copied from ~~PROJQUAY-729~~

I have been thinking that the Quay database exhausting its number of connections during an outage is actually a cascading failure from the application servers. By reviewing the gunicorn config, we should be able tighten up control over some variables that could prevent these failures from happening in the future.

Here are some notes that I compiled in a brief spike on the idea:

* Configure the worker temp directory to use memory
  * What
    * This avoids all the filesystem overhead when gunicorn workers are trying to heartbeat
  * Why
    * This will make Quay less susceptible to noisy neighbors or changes to container engine drivers.
    * Reduces gunicorn coordination time, which could cause faster startups.
  * Documentation
    * https://docs.gunicorn.org/en/stable/settings.html?highlight=bind#worker-tmp-dir
    * https://pythonspeed.com/articles/gunicorn-in-docker

* Configure the maximum number of worker connections per gunicorn worker
  * What
    * This is the resource boundary for greenlet concurrency
    * We're using the default limit (1000).
  * Why
    * By using this limit, we can control the maximum number of resources (open FDs, DB conns, etc...) each Quay instance can create.
  * Follow ups
    * Review DB connection pooling maximum connections and stale timeouts.
    * greenlets * connections per pool * gunicorn workers * pods = max db connections
  * Documentation
    * https://docs.gunicorn.org/en/stable/settings.html?highlight=bind#worker-connections
    * https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7#a258

* Only ever use the recommended gunicorn worker count
  * What
    * This is the number of gunicorn worker processes that are forked.
    * The recommended value is (2*CPU)+1 or (4*CPU)+1.
  * Why
    * The number of workers is generally not the knob you're supposed to tune for concurrency.
    * In order to apply the scientific method more effectively when profiling performance, we should remove unnecessarily variable values.
  * Existing code
    * https://github.com/quay/quay/blob/fd9975d20f59a69789fef992bb09ae6d24f7fbbd/util/workers.py
      * Gunicorn already has an environment variable to override the default, `WEB_CONCURRENCY`.
  * Documentation
    * https://docs.gunicorn.org/en/stable/settings.html?highlight=bind#worker-processes

* Create a Prometheus Gauge for number of gunicorn workers
  * What
    * This is a metric we could add to Quay that would track the number of running gunicorn workers.
  * Why
    * This will allow us to see if gunicorn processes are exiting unexpectedly.
  * Documentation
    * https://docs.gunicorn.org/en/stable/settings.html?highlight=bind#nworkers-changed

Assignee:: Unassigned

Reporter:: Kurtis Mullins (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2020/07/13 12:03 PM

Updated:: 2024/07/29 10:07 PM

Resolved:: 2021/01/18 2:18 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates