Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-4792

Allow storage replication worker to continue functioning on certain exceptions

XMLWordPrintable

    • False
    • None
    • False
    • Quay Enterprise

      Currently, we only capture IOErrors in the storage replication worker:

      https://github.com/quay/quay/blob/1a60cbe7fbdd391f073035d94596bbad8f8c0842/workers/storagereplication.py#L122

      For any other exception, the worker errors out without any possibility of a restart. While this is okay in certain situations, for intermittent networking issues this becomes a problem since the instance no longer participate in blob geo-replication. This is one such example:

      storagereplication stdout | 2022-11-28 23:06:23,478 [95] [ERROR] [__main__] Unknown exception when copying path sha256/4b/4b47a2dec9fd7d24dafb63d6a612b8eeb26985316c34637523ef383504da3e15 of image storage 0ee91d6d-4b71-4ce3-a865-fc982bd323fd to loc LOCATION
      storagereplication stdout | Traceback (most recent call last):
      storagereplication stdout |   File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 441, in _error_catcher
      ...
      packages/botocore/response.py", line 103, in read
      storagereplication stdout |     raise ResponseStreamingError(error=e)
      storagereplication stdout | botocore.exceptions.ResponseStreamingError: An error occurred while reading from response stream: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
      storagereplication stdout | 2022-11-28 23:06:23,491 [95] [ERROR] [workers.queueworker] The worker has encountered an error via the job and will not take new jobs
      

      We would propose that this exeption is also caught by the code and that a retry is issued.

              sleesinc Kenny Lee Sin Cheong
              rhn-support-ibazulic Ivan Bazulic
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: