Uploaded image for project: 'Automation Hub'
  1. Automation Hub
  2. AAH-1415

Signal 9 killing pulp-worker for curate synclist repo tasks

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Normal Normal
    • None
    • cloud-2022-04-28
    • Backend, Infrastructure
    • False
    • False

      Our `curate_all_synclist_repository` parent task creates many `curate_synclist_repository_batch` tasks. In our c.rh.c environments (ephemeral envs, stage, prod) many of these fail: state=`failed`, reason=`Worker has gone missing.` 

      However it seems the task actually completes and the worker is not missing:

      • Inside failed task, worker says missing=`False`, with updated heartbeat, and the worker name still is an active pod. There is no suggestion of OOM. We can’t easily add dmesg but if that seems the best course of action we can investigate this more.
      • The logs seem to show these tasks are running as expected - adding CollectionVersions to synclist repositories (each task touches hundreds of synclist repos, so we have not validated all)
      • In checking tasks, each time we run `curate_all_synclist_repository` of the 94 batch tasks we consistently get 36 with state=completed and 58 with state=failed

            Unassigned Unassigned
            awcrosby5 Andrew Crosby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: