Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-10878

Quay 3.17 Org Mirror: discovery claim expires before completion for large Harbor projects, causing infinite restart loop

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • quay-v3.17.0
    • quay
    • False
    • Hide

      None

      Show
      None
    • False

      Summary

      The organization mirror discovery phase has a hard 30-minute claim expiry (MAX_DISCOVERY_DURATION). For large Harbor projects (e.g. 10,000 repositories), the discovery process requires 100+ paginated HTTP requests each subject to a 30-second network timeout. Under load or latency, the total discovery time can exceed 30 minutes, causing the claim to expire mid-discovery. The next worker run then re-claims the config and restarts discovery from scratch, potentially entering an infinite restart loop that never completes.

      Affected Files

      • data/model/org_mirror.py:862MAX_DISCOVERY_DURATION = 60 * 30 (30 minutes)
      • util/orgmirror/harbor_adapter.py:68 — paginated HTTP loop with per-request timeout=30s

      Bug Details

      When an org mirror config is claimed for discovery, an expiration timestamp is set:

      # data/model/org_mirror.py:862
      MAX_DISCOVERY_DURATION = 60 * 30  # 30 minutes
      
      # claim_org_mirror_config()
      expiration_date = now + timedelta(seconds=MAX_DISCOVERY_DURATION)
      

      The Harbor adapter fetches repositories page by page:

      # util/orgmirror/harbor_adapter.py
      params = {"page": page, "page_size": self.page_size}  # default page_size=100
      response = self.session.get(url, params=params, timeout=self.timeout)  # default timeout=30s
      

      For a Harbor project with 10,000 repositories:

      • Pages required: 10,000 / 100 = 100 paginated HTTP requests
      • Worst-case per-request time: 30 seconds (full timeout)
      • Worst-case total time: 100 × 30s = 50 minutes

      This exceeds MAX_DISCOVERY_DURATION (30 minutes) by 20 minutes.

      When the expiry is detected by the next worker run, expire_org_mirror_config() resets the config to NEVER_RUN and restarts discovery from page 1. If Harbor remains under load, every subsequent discovery attempt also times out, creating an infinite restart loop where discovery never completes.

      Impact

      • Organization mirror discovery never completes for large Harbor projects under network load
      • No repositories are ever queued for sync
      • The failure is silent: the config is reset and retried without alerting the operator
      • The worker continuously consumes CPU and makes repeated paginated API calls to Harbor without making forward progress
      • Affects any Harbor project where: num_repos / page_size × per_request_latency > MAX_DISCOVERY_DURATION

      Reproduction Conditions

      • Source registry type: Harbor
      • Harbor project with a large number of repositories (e.g. ≥10,000)
      • Harbor registry experiencing elevated response latency (≥1.8s average per request for 10,000 repos to hit the 30-minute limit; any latency for larger projects)
      • ORG_MIRROR_INTERVAL is shorter than the actual discovery time

      Expected Behavior

      Discovery of a large Harbor project should complete successfully regardless of project size, or fail with a clear operator-visible error after a reasonable number of attempts.

      Actual Behavior

      Discovery exceeds its 30-minute claim window, the claim expires, the config is reset to NEVER_RUN, and the next worker run restarts discovery from scratch. This loop repeats indefinitely under sustained Harbor load.

      Additional Context

      • The 30-minute MAX_DISCOVERY_DURATION is a hard-coded constant with no configuration override.
      • The default HTTP request timeout is 30 seconds per page, also not configurable at the per-adapter level.
      • The Harbor adapter has no resumption capability: every discovery run starts from page 1.
      • The theoretical worst-case scales linearly: a 100,000-repo project requires 1,000 paginated requests, making the limit unreachable even under ideal network conditions (1,000 × minimum HTTP RTT).
      • Related: PROJQUAY-10877 (Harbor pagination may stop at first page if Link:next header is absent)

              Unassigned Unassigned
              lzha1981 luffy zhang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: