-
Bug
-
Resolution: Done
-
Critical
-
quay-3.7
Currently, we only capture IOErrors in the storage replication worker:
For any other exception, the worker errors out without any possibility of a restart. While this is okay in certain situations, for intermittent networking issues this becomes a problem since the instance no longer participate in blob geo-replication. This is one such example:
storagereplication stdout | 2022-11-28 23:06:23,478 [95] [ERROR] [__main__] Unknown exception when copying path sha256/4b/4b47a2dec9fd7d24dafb63d6a612b8eeb26985316c34637523ef383504da3e15 of image storage 0ee91d6d-4b71-4ce3-a865-fc982bd323fd to loc LOCATION storagereplication stdout | Traceback (most recent call last): storagereplication stdout | File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 441, in _error_catcher ... packages/botocore/response.py", line 103, in read storagereplication stdout | raise ResponseStreamingError(error=e) storagereplication stdout | botocore.exceptions.ResponseStreamingError: An error occurred while reading from response stream: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) storagereplication stdout | 2022-11-28 23:06:23,491 [95] [ERROR] [workers.queueworker] The worker has encountered an error via the job and will not take new jobs
We would propose that this exeption is also caught by the code and that a retry is issued.
- links to
- mentioned on
(1 mentioned on)