-
Bug
-
Resolution: Obsolete
-
Normal
-
None
-
cloud-2022-04-28
-
False
-
False
-
Our `curate_all_synclist_repository` parent task creates many `curate_synclist_repository_batch` tasks. In our c.rh.c environments (ephemeral envs, stage, prod) many of these fail: state=`failed`, reason=`Worker has gone missing.`
However it seems the task actually completes and the worker is not missing:
- Inside failed task, worker says missing=`False`, with updated heartbeat, and the worker name still is an active pod. There is no suggestion of OOM. We can’t easily add dmesg but if that seems the best course of action we can investigate this more.
- The logs seem to show these tasks are running as expected - adding CollectionVersions to synclist repositories (each task touches hundreds of synclist repos, so we have not validated all)
- In checking tasks, each time we run `curate_all_synclist_repository` of the 94 batch tasks we consistently get 36 with state=completed and 58 with state=failed