Details
Description
Issue: Customer has two buckets default(emea) and apacstorage, the apacstorage needs to be decommissioned. However util/backfillreplication.py is unable queue jobs for replication and it get killed after a while
sh-4.2$ scl enable python27 bash bash-4.2$ python -m util.backfillreplication No handlers could be found for logger "util.config.provider.baseprovider" data/secscan_model/__init__.py:22: DeprecationWarning: Call to deprecated class V2SecurityScanner. (Will be replaced by a V4 API security scanner soon) self._model = V2SecurityScanner(app, instance_keys, storage) Enqueueing image storage fa0dee00-87e1-4148-8a71-4ccc59cde327 to be replicated Enqueueing image storage e6c60f41-0aa5-4c59-9044-f67311d194a9 to be replicated Enqueueing image storage 23e199b2-1eb5-4e74-9bfc-675c571d4bb2 to be replicated Enqueueing image storage 9bc4c59a-646c-4180-8f8b-f605d87a4fd2 to be replicated Killed
So we added more steps and we could see it fails at different repositories everytime and it tries to replicate layers from non existent storage buckets
image vehicle-service-admin-center-gateway namespace %s 461 locations %s set([]) default location ['default', 'apacstorage'] locations_required %s set(['default', 'apacstorage']) existing_locations set([u'default', u'europestorage', u'ndcstorage', u'cloudian', u'apacstorage']) locations_missing################# set([]) image legal-structure Killed
So root cause of the problem, Customer added/removed multiple storage in the past and some of the blobs were not copied over correctly and images are still pointing to older S3 buckets
qedb=# select * from imagestoragelocation; id | name ---+------------------ 1 | s3_us_east_1 2 | s3_eu_west_1 3 | s3_ap_southeast_1 4 | s3_ap_southeast_2 5 | s3_ap_northeast_1 6 | s3_sa_east_1 7 | local 8 | s3_us_west_1 9 | default 10 | apacstorage 11 | europestorage 44 | ndcstorage 45 | cloudian 46 | oldbuck 47 | ecs (15 rows)
following number of records are present from removed buckets in DB .
#Cloudian select count(*) from imagestorageplacement where storage_id in (select storage_id from imagestorageplacement group by storage_id having count(*)>1) and location_id=45; count -------- 443233 (1 row) #europestorage qedb=# select count(*) from imagestorageplacement where storage_id in (select storage_id from imagestorageplacement group by storage_id having count(*)>1) and location_id=11; count ------- 290 qedb=# select count(*) from imagestorageplacement where storage_id in (select storage_id from imagestorageplacement group by storage_id having count(*)>1) and location_id=10; count --------- 1082259 (1 row) qedb=# select count(*) from imagestorageplacement where storage_id in (select storage_id from imagestorageplacement group by storage_id having count(*)>1) and location_id=44; count -------- 291433 (1 row)
Expectation: We need a way to migrate data in apacstorage to default(emea) and ensure Quay is able to point to default for all layers which previously existed in apacstorage.
Suggested recovery steps:
- *Remove all storage ID present in location_id in *11,44,45,46,47
- Recursive copy all data from apacstorage to emea storage
- Remove apacstorage from config.yaml file and remove georeplication and restart quay with new config
- Delete duplicate storage_id which has location_id as apacstorage but present in both emea and APAC.
delete from imagestorageplacement where id IN (select id from (select id from imagestorageplacement where storage_id in (select storage_id from imagestorageplacement group by storage_id having count(*)>1) and location_id=10) AS C);
- now update location_id for all storage_id which has location_id set as apacstorage to default(emea)
update imagestorageplacement SET location_id=9 where location_id=10;
- Ensure all storage_id are pointing to location_id default
We need engineering to verify if the above steps are fine to deprecate one of the storages and complete storage migration**