It seems that Quay is not releasing or closing connections properly when database activity has ceased. This happens in all modes, regardless of whether we use or don't use connection pooling. As far as I understand, with connection pooling turned off, when db.close() is issued by Peewee/Psycopg2, the database connection should be closed completely and a new connection should be opened in instead, for a new transaction. This is not what I'm seeing locally and not what clients are seeing either:
postgres=# select * from pg_stat_activity where datname = 'quayclone'; datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query_id | query | backend_type ---------+-----------+--------+------------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+------------+-------------------------------+-------------------------------+-----------------+------------+-------+-------------+--------------+----------+--------+---------------- 3698451 | quayclone | 131351 | | 16388 | testuser | | 172.17.0.4 | | 37602 | 2024-11-26 12:32:00.117419-05 | | 2024-11-26 13:14:00.118267-05 | 2024-11-26 13:14:00.118304-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 131429 | | 16388 | testuser | | 172.17.0.4 | | 37610 | 2024-11-26 12:32:05.320223-05 | | 2024-11-26 13:14:15.31884-05 | 2024-11-26 13:14:15.318873-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 131438 | | 16388 | testuser | | 172.17.0.4 | | 42892 | 2024-11-26 12:32:09.895336-05 | | 2024-11-26 13:14:09.893118-05 | 2024-11-26 13:14:09.893156-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 131443 | | 16388 | testuser | | 172.17.0.4 | | 42906 | 2024-11-26 12:32:14.206605-05 | | 2024-11-26 13:14:14.204054-05 | 2024-11-26 13:14:14.204096-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 131452 | | 16388 | testuser | | 172.17.0.4 | | 37664 | 2024-11-26 12:32:22.165531-05 | | 2024-11-26 13:13:22.164289-05 | 2024-11-26 13:13:22.164336-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 131460 | | 16388 | testuser | | 172.17.0.4 | | 37668 | 2024-11-26 12:32:23.27489-05 | | 2024-11-26 13:13:23.273631-05 | 2024-11-26 13:13:23.273666-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 131685 | | 16388 | testuser | | 172.17.0.4 | | 33760 | 2024-11-26 12:32:37.380294-05 | | 2024-11-26 13:14:12.382933-05 | 2024-11-26 13:14:12.382948-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 131703 | | 16388 | testuser | | 172.17.0.4 | | 53326 | 2024-11-26 12:32:55.321139-05 | | 2024-11-26 13:13:55.318663-05 | 2024-11-26 13:13:55.3187-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 136791 | | 16388 | testuser | | 172.17.0.4 | | 39698 | 2024-11-26 12:47:08.345172-05 | | 2024-11-26 12:47:08.359279-05 | 2024-11-26 12:47:08.359291-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 138678 | | 16388 | testuser | | 172.17.0.4 | | 42722 | 2024-11-26 12:54:04.129339-05 | | 2024-11-26 12:54:04.140437-05 | 2024-11-26 12:54:04.140454-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 141139 | | 16388 | testuser | | 172.17.0.4 | | 40136 | 2024-11-26 13:02:00.40037-05 | | 2024-11-26 13:14:15.399526-05 | 2024-11-26 13:14:15.399568-05 | Client | ClientRead | idle | | | | COMMIT | client backend 3698451 | quayclone | 142102 | | 16388 | testuser | | 172.17.0.4 | | 44952 | 2024-11-26 13:04:20.287225-05 | | 2024-11-26 13:04:20.299892-05 | 2024-11-26 13:04:20.30514-05 | Client | ClientRead | idle | | | | COMMIT | client backend (12 rows) postgres=#
12 connections in idle]] status. With {{DB_CONNECTION_POOLING parameter removed as an env. variable, the only worker that pools connections is gunicorn-registry. Again, we see a large amount of idle connections. If stale time is set in db connection parameters, Peewee does seem to honor that and closes stale connections and returns them to pool:
gunicorn-registry stdout | 2024-11-26 16:02:29,123 [375] [DEBUG] [peewee.pool] No connection available in pool.
gunicorn-registry stdout | 2024-11-26 16:02:29,137 [375] [DEBUG] [peewee.pool] Created new connection 126199438183616.
...
gunicorn-registry stdout | 2024-11-26 16:02:29,386 [375] [DEBUG] [data.database] Disconnecting from database.
gunicorn-registry stdout | 2024-11-26 16:02:29,386 [375] [DEBUG] [peewee.pool] Returning 126199438183616 to pool.
...
gunicorn-registry stdout | 2024-11-26 16:02:29,401 [375] [DEBUG] [data.database] Disconnecting from database.
gunicorn-registry stdout | 2024-11-26 16:02:29,401 [375] [DEBUG] [peewee.pool] Returning 126199438183616 to pool.
However, again, this does not lower the number of idle connections. Again, when Peewee closes them, it seems as though these connections are set to idle state and just left as it is. Which is strange, again, on explicit closing of thee db connection, the idle connection should be removed.
If DB_CONNECTION_POOLING=true, then pool seems to be active for all workers, even though on initialization only the three main workers show a pool being initialized:
gunicorn-secscan stdout | 2024-11-26 14:07:18,018 [79] [INFO] [data.database] Connection pooling enabled for postgresql; stale timeout: 60; max connection count: 5 gunicorn-web stdout | 2024-11-26 14:07:19,037 [80] [INFO] [data.database] Connection pooling enabled for postgresql; stale timeout: 60; max connection count: 5 gunicorn-registry stdout | 2024-11-26 14:07:20,236 [78] [INFO] [data.database] Connection pooling enabled for postgresql; stale timeout: 60; max connection count: 5
However, logs show a different thing:
gunicorn-registry stdout | 2024-11-26 15:18:09,282 [79] [INFO] [data.database] Connection pooling enabled for postgresql; stale timeout: 10; max connection count: 5 gunicorn-secscan stdout | 2024-11-26 15:18:09,571 [80] [INFO] [data.database] Connection pooling enabled for postgresql; stale timeout: 10; max connection count: 5 gunicorn-web stdout | 2024-11-26 15:18:09,584 [81] [INFO] [data.database] Connection pooling enabled for postgresql; stale timeout: 10; max connection count: 5 buildlogsarchiver stdout | 2024-11-26 15:18:12,893 [72] [DEBUG] [peewee.pool] No connection available in pool. buildlogsarchiver stdout | 2024-11-26 15:18:12,906 [72] [DEBUG] [peewee.pool] Created new connection 140352290218688. gcworker stdout | 2024-11-26 15:18:15,584 [77] [DEBUG] [peewee.pool] No connection available in pool. gcworker stdout | 2024-11-26 15:18:15,597 [77] [DEBUG] [peewee.pool] Created new connection 123804494868864. repositorygcworker stdout | 2024-11-26 15:18:16,936 [131] [DEBUG] [peewee.pool] No connection available in pool. repositorygcworker stdout | 2024-11-26 15:18:16,949 [131] [DEBUG] [peewee.pool] Created new connection 132507627151744. notificationworker stdout | 2024-11-26 15:18:19,178 [88] [DEBUG] [peewee.pool] No connection available in pool. notificationworker stdout | 2024-11-26 15:18:19,190 [88] [DEBUG] [peewee.pool] Created new connection 125431485128192. namespacegcworker stdout | 2024-11-26 15:18:20,066 [86] [DEBUG] [peewee.pool] No connection available in pool. namespacegcworker stdout | 2024-11-26 15:18:20,078 [86] [DEBUG] [peewee.pool] Created new connection 136450510438784. quotaregistrysizeworker stdout | 2024-11-26 15:18:34,017 [100] [DEBUG] [peewee.pool] No connection available in pool. quotaregistrysizeworker stdout | 2024-11-26 15:18:34,030 [100] [DEBUG] [peewee.pool] Created new connection 140598551536000.
which indicates that all workers, in fact, are using pooling. So which one is it? The number of idle connections seem to skyrocket here, as connection are kept in idle state.
I have also tried to figure out what's hapenning from PostgreSQL side, for this reason I enabled logging of all statements and log connections/disconnections. I do see both in Quay but their number doesn't appear to be the same:
root@cyberdyne:/var/log/postgresql# cat postgresql-16-main.log | grep -i "connection authorized" | wc -l 37 root@cyberdyne:/var/log/postgresql# cat postgresql-16-main.log | grep -i "disconnection" | wc -l 26
So more connections to the database than disconnections to the db, which is not something I'd expect if pooling is explicitly turned off. So from this perspective, it looks like Peewee does not close connections for certain workers. For instance:
2024-11-26 14:12:57.555 EST [164175] [unknown]@[unknown] LOG: connection received: host=172.17.0.4 port=55738 2024-11-26 14:12:57.564 EST [164175] testuser@quayclone LOG: connection authenticated: identity="testuser" method=md5 (/etc/postgresql/16/main/pg_hba.conf:135) 2024-11-26 14:12:57.564 EST [164175] testuser@quayclone LOG: connection authorized: user=testuser database=quayclone SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256) 2024-11-26 14:12:57.566 EST [164175] testuser@quayclone LOG: duration: 0.052 ms statement: BEGIN 2024-11-26 14:12:57.567 EST [164175] testuser@quayclone LOG: duration: 1.023 ms statement: SELECT DISTINCT "t1"."removed_tag_expiration_s" FROM "user" AS "t1" LIMIT 100 2024-11-26 14:12:57.567 EST [164175] testuser@quayclone LOG: duration: 0.011 ms statement: COMMIT 2024-11-26 14:12:57.568 EST [164175] testuser@quayclone LOG: duration: 0.004 ms statement: BEGIN 2024-11-26 14:12:57.570 EST [164175] testuser@quayclone LOG: duration: 1.820 ms statement: SELECT "candidates"."repository_id" FROM (SELECT DISTINCT "t1"."repository_id" FROM "tag" AS "t1" INNER JOIN "repository" AS "t2" ON ("t1"."repository_id" = "t2"."id") INNER JOIN "user" AS "t3" ON ("t2"."namespace_user_id" = "t3"."id") WHERE ((((NOT ("t1"."lifetime_end_ms" IS NULL) AND ("t1"."lifetime_end_ms" <= 1731438777567)) AND ("t3"."removed_tag_expiration_s" = 1209600)) AND ("t3"."enabled" = true)) AND ("t2"."state" != 3)) LIMIT 500) AS "candidates" ORDER BY Random() LIMIT 1 OFFSET 0 2024-11-26 14:12:57.570 EST [164175] testuser@quayclone LOG: duration: 0.014 ms statement: COMMIT
seems to correspond to this Quay log:
gcworker stdout | 2024-11-26 19:12:57,551 [76] [DEBUG] [apscheduler.scheduler] Looking for jobs to run gcworker stdout | 2024-11-26 19:12:57,552 [76] [INFO] [apscheduler.executors.default] Running job "GarbageCollectionWorker._garbage_collection_repos (trigger: interval[0:00:30], next run at: 2024-11-26 19:12:57 UTC)" (scheduled at 2024-11-26 19:12:57.551312+00:00) gcworker stdout | 2024-11-26 19:12:57,553 [76] [DEBUG] [peewee] ('SELECT DISTINCT "t1"."removed_tag_expiration_s" FROM "user" AS "t1" LIMIT %s', [100]) gcworker stdout | 2024-11-26 19:12:57,553 [76] [DEBUG] [apscheduler.scheduler] Next wakeup is due at 2024-11-26 19:13:27.551312+00:00 (in 29.997688 seconds) gcworker stdout | 2024-11-26 19:12:57,568 [76] [DEBUG] [peewee] ('SELECT "candidates"."repository_id" FROM (SELECT DISTINCT "t1"."repository_id" FROM "tag" AS "t1" INNER JOIN "repository" AS "t2" ON ("t1"."repository_id" = "t2"."id") INNER JOIN "user" AS "t3" ON ("t2"."namespace_user_id" = "t3"."id") WHERE ((((NOT ("t1"."lifetime_end_ms" IS %s) AND ("t1"."lifetime_end_ms" <= %s)) AND ("t3"."removed_tag_expiration_s" = %s)) AND ("t3"."enabled" = %s)) AND ("t2"."state" != %s)) LIMIT %s) AS "candidates" ORDER BY Random() LIMIT %s OFFSET %s', [None, 1731438777567, 1209600, True, 3, 500, 1, 0]) gcworker stdout | 2024-11-26 19:12:57,571 [76] [DEBUG] [__main__] No repository with garbage found
The timezone in Quay is set to GMT, the PostgreSQL timezone is set to America/Toronto or EST. So these two are the same. There is no closure of the db connection from gcworker side, even though the transaction is done and was committed. If pooling is turned off, shouldn't there be a disconnection as well?
Can someone please explain this behaviour?