-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
quay-v3.17.0
-
False
-
-
False
-
-
Missing Full Organization Mirror Sync Lifecycle Metrics
Problem Statement
The organization mirror Prometheus metrics implementation (PROJQUAY-10048) is missing metrics to track full organization sync lifecycle (discovery phase + all repository sync operations). Current metrics only track individual repository operations, making it impossible to measure end-to-end org mirror performance.
Current State
Implemented (Individual Operation Metrics):
- quay_org_mirror_repo_sync_total{status} - Tracks individual repository sync attempts
- quay_org_mirror_repo_sync_duration_seconds - Tracks duration of individual repository syncs
- quay_org_mirror_discovery_total{status} - Tracks discovery phase only
- quay_org_mirror_discovery_duration_seconds - Tracks discovery duration only
Missing (Full Organization Lifecycle Metrics):
- No metric for full org mirror sync attempts (discovery + all repo syncs combined)
- No metric for total org mirror sync duration (end-to-end time)
Impact
Cannot Answer Critical Operational Questions:
- "How long does it take to fully sync an organization with 1000 repositories?"
- "What is the success rate of full organization mirror operations?"
- "What is the P95 latency for complete org mirror sync?"
- "Are full org syncs getting slower over time?"
Operational Gaps:
- No SLO/SLA monitoring for full org mirror lifecycle
- Cannot set alerting thresholds for org sync duration
- Difficult to capacity plan for large organization migrations
- No visibility into end-to-end mirror performance
Root Cause Analysis
File: workers/repomirrorworker/_init_.py
Current Implementation:
# Lines 102-111: Only individual repository sync metrics
org_mirror_repo_sync_total = Counter(
"quay_org_mirror_repo_sync_total",
"total number of org-level mirror repository sync operations", # ← Individual repos
labelnames=["status"],
)
org_mirror_repo_sync_duration_seconds = Histogram(
"quay_org_mirror_repo_sync_duration_seconds",
"duration of org-level mirror repository sync operations in seconds", # ← Individual repos
)
What JIRA PROJQUAY-10048 Originally Specified:
# Full organization sync metrics (NOT implemented)
org_mirror_sync_total = Counter(
"quay_org_mirror_sync_total",
"Total number of organization mirror sync attempts", # ← Full org sync
["status"],
)
org_mirror_sync_duration_seconds = Histogram(
"quay_org_mirror_sync_duration_seconds",
"Time taken to complete full sync for an organization mirror", # ← Full org sync
buckets=[10, 30, 60, 120, 300, 600, 1800, 3600], # 10s to 1hr
)
Proposed Solution
Add Missing Metrics
File: workers/repomirrorworker/_init_.py
# Add after line 111
# Full organization mirror sync lifecycle metrics
org_mirror_sync_total = Counter(
"quay_org_mirror_sync_total",
"total number of full organization mirror sync operations (discovery + all repo syncs)",
labelnames=["status"], # success, fail, cancel
)
org_mirror_sync_duration_seconds = Histogram(
"quay_org_mirror_sync_duration_seconds",
"duration of full organization mirror sync operations in seconds (discovery + all repo syncs)",
buckets=[10, 30, 60, 120, 300, 600, 1800, 3600, 7200], # 10s to 2hr
)
Instrumentation Points
Discovery Phase Start: Track when org mirror sync begins
# In perform_org_mirror_discovery() - line ~900 sync_start_time = time.monotonic()
All Repos Synced: Track when all discovered repos finish syncing
# After all discovered repos processed total_duration = time.monotonic() - sync_start_time org_mirror_sync_duration_seconds.observe(total_duration) if all_repos_success: org_mirror_sync_total.labels(status="success").inc() elif any_repo_failed: org_mirror_sync_total.labels(status="fail").inc() else: org_mirror_sync_total.labels(status="cancel").inc()
Metric Semantics
Success Criteria for Full Org Sync:
- Discovery phase succeeds
- All discovered repositories sync successfully (or no repos discovered)
- No critical errors during processing
Failure Criteria:
- Discovery phase fails
- Any repository sync fails (even if others succeed)
- Worker preempted or cancelled
Duration Measurement:
- Start: When org mirror config is claimed for discovery
- End: When last repository sync completes (or discovery fails)
- Includes: Discovery time + all repository sync times + queue waiting time
Acceptance Criteria
- [ ] quay_org_mirror_sync_total{status} Counter implemented
- [ ] quay_org_mirror_sync_duration_seconds Histogram implemented
- [ ] Metrics track full org mirror lifecycle (discovery + all repo syncs)
- [ ] Success status requires all repos synced successfully
- [ ] Failure status set if discovery fails OR any repo sync fails
- [ ] Duration includes discovery + all repo sync operations
- [ ] Metrics follow Quay naming conventions (quay_* prefix)
- [ ] Metrics exported via /metrics endpoint
- [ ] Unit tests validate metric instrumentation
- [ ] Documentation updated with example Prometheus queries
Example Prometheus Queries (After Implementation)
# Full org sync success rate
rate(quay_org_mirror_sync_total{status="success"}[5m])
/ rate(quay_org_mirror_sync_total[5m])
# P95 full org sync duration
histogram_quantile(0.95,
rate(quay_org_mirror_sync_duration_seconds_bucket[1h])
)
# Average full org sync time over last hour
rate(quay_org_mirror_sync_duration_seconds_sum[1h])
/ rate(quay_org_mirror_sync_duration_seconds_count[1h])
# Alert: Org sync taking longer than 30 minutes
quay_org_mirror_sync_duration_seconds > 1800
Implementation Notes
Relationship to Existing Metrics:
- org_mirror_sync_total = 1 per organization mirror run
- org_mirror_repo_sync_total = N per organization mirror run (where N = discovered repos)
- Both metrics are necessary for different granularity levels
Performance Impact:
- Minimal overhead (2 additional metric updates per org sync)
- No database calls for metric updates
- Standard prometheus_client operations (.inc(), .observe())
Related Issues
- Parent: PROJQUAY-10048 - [Metrics] Prometheus Instrumentation (Closed)
- Related: PROJQUAY-10040 - Organization-level repository mirroring
Technical Details
Affected Files:
- workers/repomirrorworker/_init_.py - Metric definitions (after line 111)
- workers/repomirrorworker/_init_.py - Instrumentation in perform_org_mirror_discovery() and repo sync loop
Current Workaround:
Operators can derive approximate org sync duration by querying:
# Approximate (but inaccurate) org sync duration quay_org_mirror_discovery_duration_seconds + sum(quay_org_mirror_repo_sync_duration_seconds)
This workaround:
- Does not account for queue waiting time
- Cannot distinguish concurrent org syncs
- Requires complex PromQL queries
- Does not provide success/failure tracking
Discovery
Discovered During: Comprehensive metrics compliance review against JIRA PROJQUAY-10048
Date: 2026-03-05
Analysis Document: /tmp/org-mirror-metrics-implementation-status.md
Priority Justification
High Priority Because:
- Blocks SLO/SLA monitoring for org mirror feature
- Required for production readiness of large-scale migrations
- Originally specified in PROJQUAY-10048 but not implemented
- Critical for capacity planning and performance troubleshooting
- Affects operational excellence for enterprise customers
Risk if Not Fixed:
- Cannot set performance alerts for org mirror operations
- Difficult to diagnose slow org sync issues
- Poor visibility into migration progress for large organizations
- Cannot establish performance baselines or detect regressions