Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: quay-v3.8.6, quay-v3.9.0
Affects Version/s: quay-3.7
Component/s: -area/georep, quay
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False
Product:
Quay Enterprise
Intelligence Requested:
Market:

Target Version:

quay-v3.9.0

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Currently, we only capture IOErrors in the storage replication worker:

https://github.com/quay/quay/blob/1a60cbe7fbdd391f073035d94596bbad8f8c0842/workers/storagereplication.py#L122

For any other exception, the worker errors out without any possibility of a restart. While this is okay in certain situations, for intermittent networking issues this becomes a problem since the instance no longer participate in blob geo-replication. This is one such example:

storagereplication stdout | 2022-11-28 23:06:23,478 [95] [ERROR] [__main__] Unknown exception when copying path sha256/4b/4b47a2dec9fd7d24dafb63d6a612b8eeb26985316c34637523ef383504da3e15 of image storage 0ee91d6d-4b71-4ce3-a865-fc982bd323fd to loc LOCATION
storagereplication stdout | Traceback (most recent call last):
storagereplication stdout |   File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 441, in _error_catcher
...
packages/botocore/response.py", line 103, in read
storagereplication stdout |     raise ResponseStreamingError(error=e)
storagereplication stdout | botocore.exceptions.ResponseStreamingError: An error occurred while reading from response stream: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
storagereplication stdout | 2022-11-28 23:06:23,491 [95] [ERROR] [workers.queueworker] The worker has encountered an error via the job and will not take new jobs

We would propose that this exeption is also caught by the code and that a retry is issued.

links to

quay/quay#1792: storagereplication: sleep on unexpected exception for retry (PROJQUAY-4792)

quay/quay#1810: [redhat-3.8] storagereplication: sleep on unexpected exception for retry (PROJQUAY-4792)

mentioned on

Merge request - Updated US source to: 2e5f257 storagereplication: sleep on unexpected exception for retry (PROJQUAY-4792) (#1792)

Merge request - Updated US source to: 12b0bc2 chore: v3.8.6 changelog bump (PROJQUAY-5279) (#1824)

Merge request - Updated US source to: 3809d7f storagereplication: sleep on unexpected exception for retry (PROJQUAY-4792) (#1810)

Merge request - Updated US source to: b2a5b3a ldap: Don't convert dashes to underscores in usernames (PROJQUAY-5253) (#1808)

(1 mentioned on)