Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: quay-v3.17.0
Component/s: quay
Labels:
- quay-3.17-qe-bugs
- triaged

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Missing Full Organization Mirror Sync Lifecycle Metrics

Problem Statement

The organization mirror Prometheus metrics implementation (PROJQUAY-10048) is missing metrics to track full organization sync lifecycle (discovery phase + all repository sync operations). Current metrics only track individual repository operations, making it impossible to measure end-to-end org mirror performance.

Current State

Implemented (Individual Operation Metrics):

quay_org_mirror_repo_sync_total{status} - Tracks individual repository sync attempts
quay_org_mirror_repo_sync_duration_seconds - Tracks duration of individual repository syncs
quay_org_mirror_discovery_total{status} - Tracks discovery phase only
quay_org_mirror_discovery_duration_seconds - Tracks discovery duration only

Missing (Full Organization Lifecycle Metrics):

No metric for full org mirror sync attempts (discovery + all repo syncs combined)
No metric for total org mirror sync duration (end-to-end time)

Impact

Cannot Answer Critical Operational Questions:

"How long does it take to fully sync an organization with 1000 repositories?"
"What is the success rate of full organization mirror operations?"
"What is the P95 latency for complete org mirror sync?"
"Are full org syncs getting slower over time?"

Operational Gaps:

No SLO/SLA monitoring for full org mirror lifecycle
Cannot set alerting thresholds for org sync duration
Difficult to capacity plan for large organization migrations
No visibility into end-to-end mirror performance

Root Cause Analysis

File: workers/repomirrorworker/_init_.py

Current Implementation:

# Lines 102-111: Only individual repository sync metrics
org_mirror_repo_sync_total = Counter(
    "quay_org_mirror_repo_sync_total",
    "total number of org-level mirror repository sync operations",  # ← Individual repos
    labelnames=["status"],
)

org_mirror_repo_sync_duration_seconds = Histogram(
    "quay_org_mirror_repo_sync_duration_seconds",
    "duration of org-level mirror repository sync operations in seconds",  # ← Individual repos
)

What JIRA PROJQUAY-10048 Originally Specified:

# Full organization sync metrics (NOT implemented)
org_mirror_sync_total = Counter(
    "quay_org_mirror_sync_total",
    "Total number of organization mirror sync attempts",  # ← Full org sync
    ["status"],
)

org_mirror_sync_duration_seconds = Histogram(
    "quay_org_mirror_sync_duration_seconds",
    "Time taken to complete full sync for an organization mirror",  # ← Full org sync
    buckets=[10, 30, 60, 120, 300, 600, 1800, 3600],  # 10s to 1hr
)

Proposed Solution

Add Missing Metrics

File: workers/repomirrorworker/_init_.py

# Add after line 111

# Full organization mirror sync lifecycle metrics
org_mirror_sync_total = Counter(
    "quay_org_mirror_sync_total",
    "total number of full organization mirror sync operations (discovery + all repo syncs)",
    labelnames=["status"],  # success, fail, cancel
)

org_mirror_sync_duration_seconds = Histogram(
    "quay_org_mirror_sync_duration_seconds",
    "duration of full organization mirror sync operations in seconds (discovery + all repo syncs)",
    buckets=[10, 30, 60, 120, 300, 600, 1800, 3600, 7200],  # 10s to 2hr
)

Instrumentation Points

Discovery Phase Start: Track when org mirror sync begins

# In perform_org_mirror_discovery() - line ~900
sync_start_time = time.monotonic()

All Repos Synced: Track when all discovered repos finish syncing

# After all discovered repos processed
total_duration = time.monotonic() - sync_start_time
org_mirror_sync_duration_seconds.observe(total_duration)

if all_repos_success:
    org_mirror_sync_total.labels(status="success").inc()
elif any_repo_failed:
    org_mirror_sync_total.labels(status="fail").inc()
else:
    org_mirror_sync_total.labels(status="cancel").inc()

Metric Semantics

Success Criteria for Full Org Sync:

Discovery phase succeeds
All discovered repositories sync successfully (or no repos discovered)
No critical errors during processing

Failure Criteria:

Discovery phase fails
Any repository sync fails (even if others succeed)
Worker preempted or cancelled

Duration Measurement:

Start: When org mirror config is claimed for discovery
End: When last repository sync completes (or discovery fails)
Includes: Discovery time + all repository sync times + queue waiting time

Acceptance Criteria

[ ] quay_org_mirror_sync_total{status} Counter implemented
[ ] quay_org_mirror_sync_duration_seconds Histogram implemented
[ ] Metrics track full org mirror lifecycle (discovery + all repo syncs)
[ ] Success status requires all repos synced successfully
[ ] Failure status set if discovery fails OR any repo sync fails
[ ] Duration includes discovery + all repo sync operations
[ ] Metrics follow Quay naming conventions (quay_* prefix)
[ ] Metrics exported via /metrics endpoint
[ ] Unit tests validate metric instrumentation
[ ] Documentation updated with example Prometheus queries

Example Prometheus Queries (After Implementation)

# Full org sync success rate
rate(quay_org_mirror_sync_total{status="success"}[5m])
  / rate(quay_org_mirror_sync_total[5m])

# P95 full org sync duration
histogram_quantile(0.95, 
  rate(quay_org_mirror_sync_duration_seconds_bucket[1h])
)

# Average full org sync time over last hour
rate(quay_org_mirror_sync_duration_seconds_sum[1h])
  / rate(quay_org_mirror_sync_duration_seconds_count[1h])

# Alert: Org sync taking longer than 30 minutes
quay_org_mirror_sync_duration_seconds > 1800

Implementation Notes

Relationship to Existing Metrics:

org_mirror_sync_total = 1 per organization mirror run
org_mirror_repo_sync_total = N per organization mirror run (where N = discovered repos)
Both metrics are necessary for different granularity levels

Performance Impact:

Minimal overhead (2 additional metric updates per org sync)
No database calls for metric updates
Standard prometheus_client operations (.inc(), .observe())

Related Issues

Parent: PROJQUAY-10048 - [Metrics] Prometheus Instrumentation (Closed)
Related: PROJQUAY-10040 - Organization-level repository mirroring

Technical Details

Affected Files:

workers/repomirrorworker/_init_.py - Metric definitions (after line 111)
workers/repomirrorworker/_init_.py - Instrumentation in perform_org_mirror_discovery() and repo sync loop

Current Workaround:
Operators can derive approximate org sync duration by querying:

# Approximate (but inaccurate) org sync duration
quay_org_mirror_discovery_duration_seconds +
sum(quay_org_mirror_repo_sync_duration_seconds)

This workaround:

Does not account for queue waiting time
Cannot distinguish concurrent org syncs
Requires complex PromQL queries
Does not provide success/failure tracking

Discovery

Discovered During: Comprehensive metrics compliance review against JIRA PROJQUAY-10048
Date: 2026-03-05
Analysis Document: /tmp/org-mirror-metrics-implementation-status.md

Priority Justification

High Priority Because:

Blocks SLO/SLA monitoring for org mirror feature
Required for production readiness of large-scale migrations
Originally specified in PROJQUAY-10048 but not implemented
Critical for capacity planning and performance troubleshooting
Affects operational excellence for enterprise customers

Risk if Not Fixed:

Cannot set performance alerts for org mirror operations
Difficult to diagnose slow org sync issues
Poor visibility into migration progress for large organizations
Cannot establish performance baselines or detect regressions

Details

Description