-
Bug
-
Resolution: Won't Do
-
Major
-
None
We have noticed that our Quay environment was getting slower and slower again.
Browsing the UI even gave lots of "500 Internal Server Errors".
Upon further investigation it turned out that the Database (Azure Postgres, 4 Cores) was at a constant high load of 100%.
Here is what we think happens and what we have done so far:
- On April 2nd we changed the value for "SERVER_HOSTNAME" in the Quay Config file.
Reason: So far we used quay-azr.cloud.internal as the hostname, to indicate this Quay instance is running on Azure.
Since we plan to only use this environment for now, it was decided to rename to a more generic name "quay.cloud.internal".
So i generated a new certificate (which allows old and new name) and changed the bespoke parameter.
- On Friday we rolled out a change to the Quay infra and due to a mistake also enabled the PROXY_STORAGE feature again.
- On Sunday we noticed Quay is nearly unusable
- We analyzed a bit what is going on in the DB to figure out what is causing the high load and it turns out, that the queueitem table had 1.2 Mio rows inside, that looks like they belong to the Storage-Replication Feature
Example:
2020-04-17 00:46:36.118757 | t | |||
5 | cb30ec1f-3b1f-4000-b4cd-c311aee1df46 9404907 |
imagestoragereplication/c2969e2d-2b49-417c-868a-cda2d9751456/ | {"namespace_user_id": 9, "storage_id": "c2969e2d-2b49-417c-868a-cda2d9751456"} | |
2020-04-17 00:46:50.402728 | t | |||
5 | d76b426d-40de-44e4-8b20-6c3113998077 9404908 |
imagestoragereplication/f6eae9ec-af0a-486f-ab33-bc3f84c95d11/ | {"namespace_user_id": 9, "storage_id": "f6eae9ec-af0a-486f-ab33-bc3f84c95d11"} | |
2020-04-17 00:46:50.402756 | t | |||
5 | ab6e2d92-4b92-491a-93ad-881a50d9cf7e 9404909 |
imagestoragereplication/32402315-1a79-427b-8335-ff7f4affa35d/ | {"namespace_user_id": 9, "storage_id": "32402315-1a79-427b-8335-ff7f4affa35d"} | |
2020-04-17 00:46:50.402772 | t | |||
5 | 7052f456-5031-4d01-9449-3511beff669a 9404910 |
imagestoragereplication/8c67cfbf-3cb1-457e-99f2-240f4329b343/ | {"namespace_user_id": 9, "storage_id": "8c67cfbf-3cb1-457e-99f2-240f4329b343"} | |
2020-04-17 00:46:50.402788 | t | |||
5 | 0ae07b38-24c0-433d-9dc4-e09fcf3d290b 9405002 |
imagestoragereplication/8b8023ff-3d49-45d7-83b4-4e51b6bd467e/ | {"namespace_user_id": 9, "storage_id": "8b8023ff-3d49-45d7-83b4-4e51b6bd467e"} | |
2020-04-17 00:46:59.231941 | t | |||
5 | 4a45a00f-9489-4757-a8da-7b3ba4f03503 9405003 |
imagestoragereplication/3f826614-7083-4946-aeea-1d4693e842b4/ | {"namespace_user_id": 9, "storage_id": "3f826614-7083-4946-aeea-1d4693e842b4"} | |
2020-04-17 00:46:59.23199 | t | |||
5 | 91f9a43a-b1c9-40b9-b097-27114d85bbb8 9405004 |
imagestoragereplication/dbdb094c-cdfa-48ed-b74e-7ef933238765/ | {"namespace_user_id": 9, "storage_id": "dbdb094c-cdfa-48ed-b74e-7ef933238765"} |
We always had the flag FEATURE_STORAGE_REPLICATION set to true as a preparation in case we would add more backends later, but we only ever used 1 Azure Storage Account as the default backend so far.
In order to get Quay back alive we did the following:
- Stop all containers
- Dump the queueitem table to a file
- set FEATURE_STORAGE_REPLICATION=fals
- Delete all storagrepliaction rows from the queueitem table
- Start the containers again
This has helped a lot as you can see by the attached graph.
The graph is for the last 30 days and shows nicely how the load changed .....
So please advise :
- Why was the StorageReplication process triggered by the changes we did ?
- What did the Replication actually try t do ?
- Why would the enablement of the Proxy_Storage Feature cause such a high load
- Where the fixes we applied the other day good or did we cause more trouble ?
- Is there more stuff in the DB that now needs cleanup ? ( Entries for replication, storagelocation, imgastorage, etc etc.)
Where are you experiencing the behavior? What environment?
Production