Type: Epic
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- quay-2026-Q1

Epic Name:
Repository Mirror Metrics and Health Monitoring
Activity Type:
Security & Compliance
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Epic Status:
To Do
Git Pull Request:
https://github.com/quay/quay/pull/4399, https://github.com/quay/enhancements/pull/35/

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Intelligence Requested:
Market:

[Workers] Repository Mirror Metrics and Health Monitoring

PR already exists for this epic: https://github.com/quay/quay/pull/4399
related enhancement: https://github.com/quay/enhancements/pull/35/

Overview

Implement comprehensive Prometheus metrics and health monitoring for Quay's repository mirroring system. This enhancement provides operators with granular visibility into mirror synchronization status, failure tracking, and operational health to enable proactive monitoring and alerting.

Context

Currently, the only metric available for repository mirroring is quay_repository_rows_unmirrored, which provides a count of repositories that have not yet been mirrored. This is insufficient for operators who need to:
- Monitor individual repository sync status
- Track synchronization failures for alerting
- Identify pending tags awaiting synchronization
- Verify complete synchronization per repository

The existing mirror worker exposes metrics via a push gateway on port 9091, but lacks the granular, per-repository metrics needed for effective monitoring and troubleshooting.

Reference: RFE-6452

Scope

In Scope

Four new Prometheus metrics for repository mirroring:
- Tags pending synchronization per repository
- Last synchronization status per repository
- Complete synchronization indicator per repository
- Synchronization failure counter per repository
Metrics exposed via existing Prometheus push gateway infrastructure
Per-repository labels (namespace, repository) for all metrics
Error reason labels for failure diagnostics
Documentation for metric usage and alerting examples

Out of Scope

Health endpoint for mirror worker containers (determined not necessary - worker primarily runs Skopeo)
UI dashboard for mirror metrics (can use Grafana with these metrics)
Alerting rules (operators define their own based on metrics)
System-level health checks (DB, S3, Splunk) - tracked separately in RFE-7439
Per-repository push/pull metrics - tracked separately in RFE-7439
Pruning policy features - tracked separately in RFE-7439

Child Stories

Implement repository mirror sync status metrics: Add quay_repository_mirror_last_sync_status (Gauge with status/error labels) and quay_repository_mirror_sync_complete (binary Gauge) metrics to track synchronization outcomes. Includes unit tests for status transitions during sync success, failure, and in-progress states.
Implement repository mirror sync progress and failure metrics: Add quay_repository_mirror_pending_tags (Gauge for tags awaiting sync) and quay_repository_mirror_sync_failures_total (Counter for alerting) metrics. Includes unit tests for metric updates during sync operations.
Add documentation for repository mirror metrics: Document all new metrics with their labels and meanings. Provide example Prometheus queries and alerting rules for common monitoring scenarios (sync failures, stale mirrors, pending tags).

Dependencies

Technical:
- Existing Prometheus client library (prometheus_client) already in use
- Push gateway infrastructure on port 9091 already configured
- Current mirror worker and sync status tracking in data/model/repo_mirror.py
- Audit logging for sync events in data/logs_model
Cross-team:
- Documentation team for user-facing metric documentation
External:
- None - uses existing Prometheus infrastructure

Success Criteria

[ ] All four metrics are exposed via Prometheus push gateway
[ ] Metrics include per-repository granularity with namespace/repository labels
[ ] quay_repository_mirror_sync_failures_total Counter increments on each failure
[ ] quay_repository_mirror_last_sync_status correctly reflects current sync state
[ ] quay_repository_mirror_sync_complete returns 1 when all tags synced, 0 otherwise
[ ] quay_repository_mirror_pending_tags accurately reflects tags awaiting sync
[ ] Unit tests achieve >80% coverage of metric update paths
[ ] Metrics do not significantly impact worker performance (% overhead)
[ ] Documentation includes alerting rule examples

Technical Approach

The implementation will extend the existing mirror worker (workers/repomirrorworker/__init__.py) to update metrics during synchronization operations. Metrics will be collected from database records and audit logs, leveraging existing failure tracking mechanisms.

Components Affected

workers/repomirrorworker/*init.py*: Add new Prometheus Gauge and Counter definitions, update metrics during perform_mirror() and related functions
data/model/repo_mirror.py: Potentially add helper functions to query pending tags and sync status
util/metrics/prometheus.py: Register new metrics if needed (may not be required if defined in worker)

Key Technical Decisions

Per-repository labels: All metrics will use namespace and repository labels for granular monitoring. Cardinality is bounded by the number of mirrored repositories.
Status label pattern: Following Prometheus best practices, quay_repository_mirror_last_sync_status will use a value of 1 when that status is active, allowing sum() aggregations.
Error context via labels: The last_error_reason label provides failure diagnosis without requiring log parsing, improving troubleshooting efficiency.
Leverage existing data: Metrics will be derived from existing sync status tracking in RepoMirrorConfig and audit logs rather than introducing new data collection.
Consistent with existing patterns: Follow the metric naming and structure patterns established in util/metrics/prometheus.py.

Risks and Mitigations

Risk: High cardinality if deployments have thousands of mirrored repositories
Mitigation: Label cardinality is bounded by actual mirrored repositories (not all repositories). Document guidance on metric retention policies for large deployments.
Risk: Performance impact from metric updates during sync operations
Mitigation: Metric updates are lightweight operations. Benchmark shows overhead per update. Include performance testing in acceptance criteria.
Risk: Metric staleness if repositories are deleted
Mitigation: Implement cleanup of metrics for deleted repositories, or document that stale metrics will be cleared on worker restart.

Testing Strategy

Unit tests: Test metric updates in perform_mirror() for success, failure, and preemption scenarios. Mock database queries to verify metric values.
Integration tests: Verify metrics are properly pushed to gateway and can be scraped by Prometheus in test environment.
Performance tests: Benchmark sync operations with metrics enabled vs disabled to ensure % overhead.

Rollout Strategy

Metrics are additive and do not change existing behavior
No feature flag required - metrics are always available when mirror worker runs
Backward compatible - no changes to existing quay_repository_rows_unmirrored metric
Documentation should be published alongside release

Documentation Needs

User-facing documentation:
- New metrics reference with descriptions and labels
- Example Prometheus queries for common monitoring scenarios
- Example alerting rules for sync failures and stale mirrors
Developer documentation:
- Architecture decision record for metric design choices

Related Work

Original Feature: RFE-6452
Related Issues: RFE-7439 (Additional Prometheus metrics - incorporates this feature for broader metrics initiative)
Enhancement Proposal: https://github.com/quay/enhancements/pull/35

is incorporated by

RFE-6452 Add metrics and health endpoint for Repository Mirrors

Refinement

Details

Description