-
Bug
-
Resolution: Done-Errata
-
Undefined
-
None
-
Unspecified
-
False
-
-
False
-
-
Description
We use galaxy-ng in our tests for ansible-core upstream. The jobs are flaky and get stuck periodically.
We don't always see the helper container logs as they aren't available in the CI but I recently observed something locally that just screams a race condition, which led me to believe that the problem is in pulp_ansible, specifically.
There are the corresponding log lines:
ERROR! Galaxy import process failed: duplicate key value violates unique constraint "unique_is_highest" DETAIL: Key (collection_id, is_highest)=(7cd3d7cd-be93-4f7e-8ca2-07a73f2a8444, t) already exists. (Code: UNKNOWN)
To add more context, our tests run a galaxy-ng container and provision a bunch of dummy collections for testing of the dependency resolver used in the ansible-galaxy collection install|download commands. Then, we proceed to test the related scenarios.
Here are the collections provisioned with our helper: https://github.com/ansible/ansible/blob/81c83c6/test/integration/targets/ansible-galaxy-collection/vars/main.yml#L27-L167
The helper itself is https://github.com/ansible/ansible/blob/81c83c6/test/integration/targets/ansible-galaxy-collection/library/setup_collections.py and upon closer inspection, it runs ansible-galaxy collection publish asynchronously in subprocesses but no more than 4 at a time.
Back to the test collections being provisioned. There is one of the collections namespace1.name1 that has 14 different versions. That's the one that is most likely to hit race conditions due to the nature of the piece of logic in pulp_ansible where it is happening.
Our helper records the results associated with each upload and the snippet above is extracted from its log.
Said log has successful uploads for all of the collections in the list (not just this one) except for namespace1.name1:0.0.7 while all the other artifacts of this collection (0.0.1-0.0.6, 0.0.8-0.0.10, 0.1.0, 1.0.0, 1.0.9, 1.1.0-beta.1) are identical and are processed in the same loop.
So that duplicate key value violates unique constraint "unique_is_highest" is a clear indication of a database constraint kicking in and failing the DB entry update. It's not hard to imagine that with the asynchronous tasks being scheduled, multiple uploads would trigger simultaneous attempts to update the database table in a way that would cause a conflict.
I'm convinced that if the code were to use transactions/locks/retries, it could be more robust and process this case more gracefully.
The constraint itself is introduced @ https://github.com/pulp/pulp_ansible/blob/567b0c7/pulp_ansible/app/models.py#L234 / https://github.com/pulp/pulp_ansible/blob/567b0c7/pulp_ansible/app/migrations/0005_collectionversion_is_highest.py#L32.
I've also found this visualization of the DB schema: https://alikins.github.io/schemaspy/tables/ansible_collectionversion.html.
I believe that the buggy piece of code that gets into the race condition directly is https://github.com/pulp/pulp_ansible/blob/567b0c7/pulp_ansible/app/tasks/collections.py#L424-L477.
Key Dependency Versions
galaxy-ng==4.7.1
pulp-ansible==0.17.3
pulp-container==2.14.7
pulp-glue==0.19.5
pulpcore==3.23.13
Steps to Reproduce
1. Make like a dozen copies of a collection with the same namespace, name and all the metadata but different versions. Maybe, hundreds would work better to be sure.
2. ansible-galaxy collection publish all of them simultaneously (could use a normal shell loop and add "&" to the publish command to spawn many processes)
3. Inspect the logs and the database.
how reproducible: under certain conditions, intermittent
Actual Behavior
The latest collection version pointer is not being updated, the logs contain an unhandled error:
ERROR! Galaxy import process failed: duplicate key value violates unique constraint "unique_is_highest" DETAIL: Key (collection_id, is_highest)=(7cd3d7cd-be93-4f7e-8ca2-07a73f2a8444, t) already exists. (Code: UNKNOWN)
Expected Behavior
No errors in the log. The latest version is always updated.
Upstream GitHub Issue
- links to
-
RHBA-2024:130453 Red Hat Ansible Automation Platform 2.4 Product Release Update