Uploaded image for project: 'Satellite'
  1. Satellite
  2. SAT-21126

Capsule sync gets stuck forever after specific connection issues

XMLWordPrintable

    • Important

      Description of problem:
      After specific temporary Sat<->Caps connection issues, Capsule sync gets stuck pending on downloading artifacs that will never happen.

      Particular scenarios for the connectivity issues:

      • drop packets between Sat and Caps for a while (2-3 minutes is enough)
      • reject packages between them for a specific amount of time (timing is crucial, see below for reproducer)
      • optionally, apply connection rate limiting first, and then drop packets (or reject for specific amount of time)

      I think the reproducer can be applicable to Satellite (syncing from some upstream repo) as well, since the scenarios are quite generic - just restrict connectivity during a repo sync task.

      Version-Release number of selected component (if applicable):
      Sat 6.12 / python39-pulpcore-3.18.16-1.el8pc.noarch
      (I expect any Sat / pulp3 version affected)

      How reproducible:
      100%

      Steps to Reproduce:
      1. Have a Capsule with Immediate download policy and no content downloaded.
      2. Have one CV with one bigger repo (say, RHEL9 BaseOS) to have syncable content to a Capsule (optionally, do a Complete Sync of such one repo after purging any repo and content from the Capsule)
      3. Launch the Capsule (Complete) Sync.
      4. Check the dynflow step will track pulp task that will be downloading packages, like:

      message: Downloading Artifacts
      code: sync.downloading.artifacts
      state: running
      done: 259

      5. When the packages download is progressing, run either below scenario to mimic connectivity issues (run on Capsule):

      5a. drop packets to/from Satellite:
      SATELLITE=ip.address.of.satellite
      iptables -F
      iptables -A OUTPUT -p tcp -d $SATELLITE -j DROP
      iptables -A INPUT -p tcp -s $SATELLITE -j DROP

      5b. reject poackets to/from Satellite:
      SATELLITE=ip.address.of.satellite
      iptables -F
      iptables -A OUTPUT -p tcp -d $SATELLITE -j REJECT
      iptables -A INPUT -p tcp -s $SATELLITE -j REJECT

      5c. apply rate limiting and drop packets over the limit:
      SATELLITE=ip.address.of.satellite
      LIMIT=10/s
      iptables -F
      iptables -A OUTPUT -p tcp -d $SATELLITE -m limit --limit $LIMIT -j ACCEPT
      iptables -A INPUT -p tcp -s $SATELLITE -m limit --limit $LIMIT -j ACCEPT
      iptables -A OUTPUT -p tcp -d $SATELLITE -j DROP
      iptables -A INPUT -p tcp -s $SATELLITE -j DROP

      5d. apply rate limiting and reject packages over the limit:
      SATELLITE=ip.address.of.satellite
      LIMIT=10/s
      iptables -F
      iptables -A OUTPUT -p tcp -d $SATELLITE -m limit --limit $LIMIT -j ACCEPT
      iptables -A INPUT -p tcp -s $SATELLITE -m limit --limit $LIMIT -j ACCEPT
      iptables -A OUTPUT -p tcp -d $SATELLITE -j REJECT
      iptables -A INPUT -p tcp -s $SATELLITE -j REJECT

      6. Wait some time. At least until Capsule logs something like:
      Nov 2 13:43:36 pmoravec-caps612 pulpcore-worker-3[72785]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pmoravec-sat612.domain.redhat.com:443 ssl:default [Connect call failed ('1.2.3.4', 443)])

      Once that is logged, run "iptables -F" (to stop the temporary connectivity problem you mimic).

      7. Monitor progress of the pulp task - it will download packets at a usual high speed now.

      8. Once the progress does not move for some time, check the pulp task status - either it is done (as well as whole Caps sync), or it is stuck forever (waiting on downloading the few packets affected by the "Backing off download_wrapper" error that wont complete any time).

      Actual results:
      Scenario 5b. leads to failed Caps sync that is unable to download some package - that is expected. /var/log/messages contains a backtrace about it (BT in Additional info):

      Scenarios 5a., 5c. and sometimes 5d. leads to stuck Capsule sync. 5d. leads to stuck only when iptables are flushed before the exception from "BT in Additional info". If you are too slow, pulp detects too many connection problems and cancels the sync like in Scenario 5b. Per my tests, the BT from Additional info pops up 13 seconds after the "Backing off download_wrapper", so the 13s is the window of opportunity to hit the bug in Scenario 5d.

      Scenarios 5a. and 5c. (packages drop) seem to get stuck every time (just flush iptables after the "Backing off .." error, any time later on).

      (I think the Scenario 5b can mean that if you trigger Capsule Sync and restart some Satellite services and you are unlucky, you hit exactly 5b and Capsule gets stuck forever - that can be difficult to troubleshoot)

      Expected results:
      pulp task to somehow terminate in reasonable time in any Scenario. Even canceling the task due to temporary connection problems is definitely better outcome than Capsule sync stuck forever.

      Additional info:
      BT from pulp task giving up sync attempts (expected behaviour):

      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: result = func(*args, **kwargs)
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 576, in synchronize
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: repo_version = dv.create() or repo.latest_version()
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py", line 161, in create
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: loop.run_until_complete(pipeline)
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: return future.result()
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 225, in create_pipeline
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: await asyncio.gather(*futures)
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 43, in _call_
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: await self.run()
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 183, in run
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: pb.done += task.result() # download_count
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 209, in _handle_content_unit
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: await asyncio.gather(*downloaders_for_content)
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/models.py", line 89, in download
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: download_result = await downloader.run(extra_data=self.extra_data)
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 273, in run
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: return await download_wrapper()
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/backoff/_async.py", line 133, in retry
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: ret = await target(*args, **kwargs)
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 258, in download_wrapper
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: return await self._run(extra_data=extra_data)
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulp_rpm/app/downloaders.py", line 114, in _run
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: async with self.session.get(
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/client.py", line 1138, in _aenter_
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: self._resp = await self._coro
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/client.py", line 535, in _request
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: conn = await self._connector.connect(
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 542, in connect
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: proto = await self._create_connection(req, traces, timeout)
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 907, in _create_connection
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: _, proto = await self._create_direct_connection(req, traces, timeout)
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: raise last_exc
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: transp, proto = await self._wrap_create_connection(
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
      Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: raise client_error(req.connection_key, exc) from exc

      BT of a worker stuck forever in download attempts:

      Thread 1 (Thread 0x7ff0931f1740
      #8 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559328068b10, for file /usr/lib64/python3.9/selectors.py, line 469, in select
      #15 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x5593280688e0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 1869, in _run_once
      #22 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4469a0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 601, in run_forever
      #29 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4465e0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 634, in run_until_complete
      #36 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a461dd0, for file /usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py, line 161, in create
      #43 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559327fb7990, for file /usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py, line 576, in synchronize
      #51 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559328051300, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 410, in _perform_task
      #59 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a431040, for file /usr/lib64/python3.9/multiprocessing/process.py, line 108, in run
      #66 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x5593280509b0, for file /usr/lib64/python3.9/multiprocessing/process.py, line 315, in _bootstrap
      #75 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a492240, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 71, in _launch
      #82 0x00007ff092b562d3 in PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4c8040, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 19, in __init_
      #95 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a496380, for file /usr/lib64/python3.9/multiprocessing/context.py, line 277, in _Popen
      #102 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4961e0, for file /usr/lib64/python3.9/multiprocessing/context.py, line 224, in _Popen
      #109 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a483d40, for file /usr/lib64/python3.9/multiprocessing/process.py, line 121, in start
      #116 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x55932803e2a0, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 298, in supervise_task
      #123 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a46d040, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 366, in run_forever
      #130 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4c7040, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/entrypoint.py, line 28, in worker
      #138 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a529840, for file /usr/lib/python3.9/site-packages/click/core.py, line 754, in invoke
      #148 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a5a9580, for file /usr/lib/python3.9/site-packages/click/core.py, line 1395, in invoke
      #155 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559327e5b4f0, for file /usr/lib/python3.9/site-packages/click/core.py, line 1053, in main
      #165 0x00007ff092be4e63 in PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff083448b20, for file /usr/lib/python3.9/site-packages/click/core.py, line 1128, in __call_

      (some "futures" object is incomplete (forever) due to improperly caught exception..?)

            jira-bugzilla-migration RH Bugzilla Integration
            rhn-support-pmoravec Pavel Moravec
            RH Bugzilla Integration RH Bugzilla Integration
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: