Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Major
Fix Version/s: quay-v3.4.0
Affects Version/s: None
Component/s: quay
Labels:
- customer_issue
- documentation

RICE Score:
0

SFDC Cases Links:
SFDC Cases Counter:

We have noticed that our Quay environment was getting slower and slower again.
Browsing the UI even gave lots of "500 Internal Server Errors".

Upon further investigation it turned out that the Database (Azure Postgres, 4 Cores) was at a constant high load of 100%.

Here is what we think happens and what we have done so far:

On April 2nd we changed the value for "SERVER_HOSTNAME" in the Quay Config file.
Reason: So far we used quay-azr.cloud.internal as the hostname, to indicate this Quay instance is running on Azure.
Since we plan to only use this environment for now, it was decided to rename to a more generic name "quay.cloud.internal".
So i generated a new certificate (which allows old and new name) and changed the bespoke parameter.

On Friday we rolled out a change to the Quay infra and due to a mistake also enabled the PROXY_STORAGE feature again.

On Sunday we noticed Quay is nearly unusable

We analyzed a bit what is going on in the DB to figure out what is causing the high load and it turns out, that the queueitem table had 1.2 Mio rows inside, that looks like they belong to the Storage-Replication Feature

Example:

2020-04-17 00:46:36.118757	t
	5	cb30ec1f-3b1f-4000-b4cd-c311aee1df46 9404907	imagestoragereplication/c2969e2d-2b49-417c-868a-cda2d9751456/	{"namespace_user_id": 9, "storage_id": "c2969e2d-2b49-417c-868a-cda2d9751456"}
2020-04-17 00:46:50.402728	t
	5	d76b426d-40de-44e4-8b20-6c3113998077 9404908	imagestoragereplication/f6eae9ec-af0a-486f-ab33-bc3f84c95d11/	{"namespace_user_id": 9, "storage_id": "f6eae9ec-af0a-486f-ab33-bc3f84c95d11"}
2020-04-17 00:46:50.402756	t
	5	ab6e2d92-4b92-491a-93ad-881a50d9cf7e 9404909	imagestoragereplication/32402315-1a79-427b-8335-ff7f4affa35d/	{"namespace_user_id": 9, "storage_id": "32402315-1a79-427b-8335-ff7f4affa35d"}
2020-04-17 00:46:50.402772	t
	5	7052f456-5031-4d01-9449-3511beff669a 9404910	imagestoragereplication/8c67cfbf-3cb1-457e-99f2-240f4329b343/	{"namespace_user_id": 9, "storage_id": "8c67cfbf-3cb1-457e-99f2-240f4329b343"}
2020-04-17 00:46:50.402788	t
	5	0ae07b38-24c0-433d-9dc4-e09fcf3d290b 9405002	imagestoragereplication/8b8023ff-3d49-45d7-83b4-4e51b6bd467e/	{"namespace_user_id": 9, "storage_id": "8b8023ff-3d49-45d7-83b4-4e51b6bd467e"}
2020-04-17 00:46:59.231941	t
	5	4a45a00f-9489-4757-a8da-7b3ba4f03503 9405003	imagestoragereplication/3f826614-7083-4946-aeea-1d4693e842b4/	{"namespace_user_id": 9, "storage_id": "3f826614-7083-4946-aeea-1d4693e842b4"}
2020-04-17 00:46:59.23199	t
	5	91f9a43a-b1c9-40b9-b097-27114d85bbb8 9405004	imagestoragereplication/dbdb094c-cdfa-48ed-b74e-7ef933238765/	{"namespace_user_id": 9, "storage_id": "dbdb094c-cdfa-48ed-b74e-7ef933238765"}

We always had the flag FEATURE_STORAGE_REPLICATION set to true as a preparation in case we would add more backends later, but we only ever used 1 Azure Storage Account as the default backend so far.

In order to get Quay back alive we did the following:

Stop all containers
Dump the queueitem table to a file
set FEATURE_STORAGE_REPLICATION=fals
Delete all storagrepliaction rows from the queueitem table
Start the containers again

This has helped a lot as you can see by the attached graph.
The graph is for the last 30 days and shows nicely how the load changed .....

So please advise :

Why was the StorageReplication process triggered by the changes we did ?
What did the Replication actually try t do ?
Why would the enablement of the Proxy_Storage Feature cause such a high load
Where the fixes we applied the other day good or did we cause more trouble ?
Is there more stuff in the DB that now needs cleanup ? ( Entries for replication, storagelocation, imgastorage, etc etc.)

Where are you experiencing the behavior? What environment?
Production

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

quay_config(1).txt
3 kB
2020/04/23 4:16 AM

Assignee:: Thomas Mckay

Reporter:: Ivan C (Inactive)

QA Contact:: Dongbo Yan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2020/04/20 12:38 PM

Updated:: 2023/09/07 11:41 PM

Resolved:: 2020/09/09 11:33 AM

Details

Description

Attachments

Attachments

Activity

People

Dates