Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-10836

Quay 3.17 Metrics Missing full organization mirror sync lifecycle metrics

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • quay-v3.17.0
    • quay
    • False
    • Hide

      None

      Show
      None
    • False

      Missing Full Organization Mirror Sync Lifecycle Metrics

      Problem Statement

      The organization mirror Prometheus metrics implementation (PROJQUAY-10048) is missing metrics to track full organization sync lifecycle (discovery phase + all repository sync operations). Current metrics only track individual repository operations, making it impossible to measure end-to-end org mirror performance.

      Current State

      Implemented (Individual Operation Metrics):

      • quay_org_mirror_repo_sync_total{status} - Tracks individual repository sync attempts
      • quay_org_mirror_repo_sync_duration_seconds - Tracks duration of individual repository syncs
      • quay_org_mirror_discovery_total{status} - Tracks discovery phase only
      • quay_org_mirror_discovery_duration_seconds - Tracks discovery duration only

      Missing (Full Organization Lifecycle Metrics):

      • No metric for full org mirror sync attempts (discovery + all repo syncs combined)
      • No metric for total org mirror sync duration (end-to-end time)

      Impact

      Cannot Answer Critical Operational Questions:

      • "How long does it take to fully sync an organization with 1000 repositories?"
      • "What is the success rate of full organization mirror operations?"
      • "What is the P95 latency for complete org mirror sync?"
      • "Are full org syncs getting slower over time?"

      Operational Gaps:

      • No SLO/SLA monitoring for full org mirror lifecycle
      • Cannot set alerting thresholds for org sync duration
      • Difficult to capacity plan for large organization migrations
      • No visibility into end-to-end mirror performance

      Root Cause Analysis

      File: workers/repomirrorworker/_init_.py

      Current Implementation:

      # Lines 102-111: Only individual repository sync metrics
      org_mirror_repo_sync_total = Counter(
          "quay_org_mirror_repo_sync_total",
          "total number of org-level mirror repository sync operations",  # ← Individual repos
          labelnames=["status"],
      )
      
      org_mirror_repo_sync_duration_seconds = Histogram(
          "quay_org_mirror_repo_sync_duration_seconds",
          "duration of org-level mirror repository sync operations in seconds",  # ← Individual repos
      )
      

      What JIRA PROJQUAY-10048 Originally Specified:

      # Full organization sync metrics (NOT implemented)
      org_mirror_sync_total = Counter(
          "quay_org_mirror_sync_total",
          "Total number of organization mirror sync attempts",  # ← Full org sync
          ["status"],
      )
      
      org_mirror_sync_duration_seconds = Histogram(
          "quay_org_mirror_sync_duration_seconds",
          "Time taken to complete full sync for an organization mirror",  # ← Full org sync
          buckets=[10, 30, 60, 120, 300, 600, 1800, 3600],  # 10s to 1hr
      )
      

      Proposed Solution

      Add Missing Metrics

      File: workers/repomirrorworker/_init_.py

      # Add after line 111
      
      # Full organization mirror sync lifecycle metrics
      org_mirror_sync_total = Counter(
          "quay_org_mirror_sync_total",
          "total number of full organization mirror sync operations (discovery + all repo syncs)",
          labelnames=["status"],  # success, fail, cancel
      )
      
      org_mirror_sync_duration_seconds = Histogram(
          "quay_org_mirror_sync_duration_seconds",
          "duration of full organization mirror sync operations in seconds (discovery + all repo syncs)",
          buckets=[10, 30, 60, 120, 300, 600, 1800, 3600, 7200],  # 10s to 2hr
      )
      

      Instrumentation Points

      Discovery Phase Start: Track when org mirror sync begins

      # In perform_org_mirror_discovery() - line ~900
      sync_start_time = time.monotonic()
      

      All Repos Synced: Track when all discovered repos finish syncing

      # After all discovered repos processed
      total_duration = time.monotonic() - sync_start_time
      org_mirror_sync_duration_seconds.observe(total_duration)
      
      if all_repos_success:
          org_mirror_sync_total.labels(status="success").inc()
      elif any_repo_failed:
          org_mirror_sync_total.labels(status="fail").inc()
      else:
          org_mirror_sync_total.labels(status="cancel").inc()
      

      Metric Semantics

      Success Criteria for Full Org Sync:

      • Discovery phase succeeds
      • All discovered repositories sync successfully (or no repos discovered)
      • No critical errors during processing

      Failure Criteria:

      • Discovery phase fails
      • Any repository sync fails (even if others succeed)
      • Worker preempted or cancelled

      Duration Measurement:

      • Start: When org mirror config is claimed for discovery
      • End: When last repository sync completes (or discovery fails)
      • Includes: Discovery time + all repository sync times + queue waiting time

      Acceptance Criteria

      • [ ] quay_org_mirror_sync_total{status} Counter implemented
      • [ ] quay_org_mirror_sync_duration_seconds Histogram implemented
      • [ ] Metrics track full org mirror lifecycle (discovery + all repo syncs)
      • [ ] Success status requires all repos synced successfully
      • [ ] Failure status set if discovery fails OR any repo sync fails
      • [ ] Duration includes discovery + all repo sync operations
      • [ ] Metrics follow Quay naming conventions (quay_* prefix)
      • [ ] Metrics exported via /metrics endpoint
      • [ ] Unit tests validate metric instrumentation
      • [ ] Documentation updated with example Prometheus queries

      Example Prometheus Queries (After Implementation)

      # Full org sync success rate
      rate(quay_org_mirror_sync_total{status="success"}[5m])
        / rate(quay_org_mirror_sync_total[5m])
      
      # P95 full org sync duration
      histogram_quantile(0.95, 
        rate(quay_org_mirror_sync_duration_seconds_bucket[1h])
      )
      
      # Average full org sync time over last hour
      rate(quay_org_mirror_sync_duration_seconds_sum[1h])
        / rate(quay_org_mirror_sync_duration_seconds_count[1h])
      
      # Alert: Org sync taking longer than 30 minutes
      quay_org_mirror_sync_duration_seconds > 1800
      

      Implementation Notes

      Relationship to Existing Metrics:

      • org_mirror_sync_total = 1 per organization mirror run
      • org_mirror_repo_sync_total = N per organization mirror run (where N = discovered repos)
      • Both metrics are necessary for different granularity levels

      Performance Impact:

      • Minimal overhead (2 additional metric updates per org sync)
      • No database calls for metric updates
      • Standard prometheus_client operations (.inc(), .observe())

      Related Issues

      • Parent: PROJQUAY-10048 - [Metrics] Prometheus Instrumentation (Closed)
      • Related: PROJQUAY-10040 - Organization-level repository mirroring

      Technical Details

      Affected Files:

      • workers/repomirrorworker/_init_.py - Metric definitions (after line 111)
      • workers/repomirrorworker/_init_.py - Instrumentation in perform_org_mirror_discovery() and repo sync loop

      Current Workaround:
      Operators can derive approximate org sync duration by querying:

      # Approximate (but inaccurate) org sync duration
      quay_org_mirror_discovery_duration_seconds +
      sum(quay_org_mirror_repo_sync_duration_seconds)
      

      This workaround:

      • Does not account for queue waiting time
      • Cannot distinguish concurrent org syncs
      • Requires complex PromQL queries
      • Does not provide success/failure tracking

      Discovery

      Discovered During: Comprehensive metrics compliance review against JIRA PROJQUAY-10048
      Date: 2026-03-05
      Analysis Document: /tmp/org-mirror-metrics-implementation-status.md

      Priority Justification

      High Priority Because:

      • Blocks SLO/SLA monitoring for org mirror feature
      • Required for production readiness of large-scale migrations
      • Originally specified in PROJQUAY-10048 but not implemented
      • Critical for capacity planning and performance troubleshooting
      • Affects operational excellence for enterprise customers

      Risk if Not Fixed:

      • Cannot set performance alerts for org mirror operations
      • Difficult to diagnose slow org sync issues
      • Poor visibility into migration progress for large organizations
      • Cannot establish performance baselines or detect regressions

              Unassigned Unassigned
              lzha1981 luffy zhang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: