-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
Repository Mirror Metrics and Health Monitoring
-
Security & Compliance
-
False
-
-
False
-
Not Selected
-
To Do
[Workers] Repository Mirror Metrics and Health Monitoring
PR already exists for this epic: https://github.com/quay/quay/pull/4399
related enhancement: https://github.com/quay/enhancements/pull/35/
Overview
Implement comprehensive Prometheus metrics and health monitoring for Quay's repository mirroring system. This enhancement provides operators with granular visibility into mirror synchronization status, failure tracking, and operational health to enable proactive monitoring and alerting.
Context
Currently, the only metric available for repository mirroring is quay_repository_rows_unmirrored, which provides a count of repositories that have not yet been mirrored. This is insufficient for operators who need to:
- Monitor individual repository sync status
- Track synchronization failures for alerting
- Identify pending tags awaiting synchronization
- Verify complete synchronization per repository
The existing mirror worker exposes metrics via a push gateway on port 9091, but lacks the granular, per-repository metrics needed for effective monitoring and troubleshooting.
Reference: RFE-6452
Scope
In Scope
- Four new Prometheus metrics for repository mirroring:
- Tags pending synchronization per repository
- Last synchronization status per repository
- Complete synchronization indicator per repository
- Synchronization failure counter per repository
- Metrics exposed via existing Prometheus push gateway infrastructure
- Per-repository labels (namespace, repository) for all metrics
- Error reason labels for failure diagnostics
- Documentation for metric usage and alerting examples
Out of Scope
- Health endpoint for mirror worker containers (determined not necessary - worker primarily runs Skopeo)
- UI dashboard for mirror metrics (can use Grafana with these metrics)
- Alerting rules (operators define their own based on metrics)
- System-level health checks (DB, S3, Splunk) - tracked separately in RFE-7439
- Per-repository push/pull metrics - tracked separately in RFE-7439
- Pruning policy features - tracked separately in RFE-7439
Child Stories
- Implement repository mirror sync status metrics: Add quay_repository_mirror_last_sync_status (Gauge with status/error labels) and quay_repository_mirror_sync_complete (binary Gauge) metrics to track synchronization outcomes. Includes unit tests for status transitions during sync success, failure, and in-progress states.
- Implement repository mirror sync progress and failure metrics: Add quay_repository_mirror_pending_tags (Gauge for tags awaiting sync) and quay_repository_mirror_sync_failures_total (Counter for alerting) metrics. Includes unit tests for metric updates during sync operations.
- Add documentation for repository mirror metrics: Document all new metrics with their labels and meanings. Provide example Prometheus queries and alerting rules for common monitoring scenarios (sync failures, stale mirrors, pending tags).
Dependencies
- Technical:
- Existing Prometheus client library (prometheus_client) already in use
- Push gateway infrastructure on port 9091 already configured
- Current mirror worker and sync status tracking in data/model/repo_mirror.py
- Audit logging for sync events in data/logs_model
- Cross-team:
- Documentation team for user-facing metric documentation
- External:
- None - uses existing Prometheus infrastructure
Success Criteria
- [ ] All four metrics are exposed via Prometheus push gateway
- [ ] Metrics include per-repository granularity with namespace/repository labels
- [ ] quay_repository_mirror_sync_failures_total Counter increments on each failure
- [ ] quay_repository_mirror_last_sync_status correctly reflects current sync state
- [ ] quay_repository_mirror_sync_complete returns 1 when all tags synced, 0 otherwise
- [ ] quay_repository_mirror_pending_tags accurately reflects tags awaiting sync
- [ ] Unit tests achieve >80% coverage of metric update paths
- [ ] Metrics do not significantly impact worker performance (% overhead)
- [ ] Documentation includes alerting rule examples
Technical Approach
The implementation will extend the existing mirror worker (workers/repomirrorworker/__init__.py) to update metrics during synchronization operations. Metrics will be collected from database records and audit logs, leveraging existing failure tracking mechanisms.
Components Affected
- workers/repomirrorworker/*init.py*: Add new Prometheus Gauge and Counter definitions, update metrics during perform_mirror() and related functions
- data/model/repo_mirror.py: Potentially add helper functions to query pending tags and sync status
- util/metrics/prometheus.py: Register new metrics if needed (may not be required if defined in worker)
Key Technical Decisions
- Per-repository labels: All metrics will use namespace and repository labels for granular monitoring. Cardinality is bounded by the number of mirrored repositories.
- Status label pattern: Following Prometheus best practices, quay_repository_mirror_last_sync_status will use a value of 1 when that status is active, allowing sum() aggregations.
- Error context via labels: The last_error_reason label provides failure diagnosis without requiring log parsing, improving troubleshooting efficiency.
- Leverage existing data: Metrics will be derived from existing sync status tracking in RepoMirrorConfig and audit logs rather than introducing new data collection.
- Consistent with existing patterns: Follow the metric naming and structure patterns established in util/metrics/prometheus.py.
Risks and Mitigations
- Risk: High cardinality if deployments have thousands of mirrored repositories
Mitigation: Label cardinality is bounded by actual mirrored repositories (not all repositories). Document guidance on metric retention policies for large deployments. - Risk: Performance impact from metric updates during sync operations
Mitigation: Metric updates are lightweight operations. Benchmark shows overhead per update. Include performance testing in acceptance criteria. - Risk: Metric staleness if repositories are deleted
Mitigation: Implement cleanup of metrics for deleted repositories, or document that stale metrics will be cleared on worker restart.
Testing Strategy
- Unit tests: Test metric updates in perform_mirror() for success, failure, and preemption scenarios. Mock database queries to verify metric values.
- Integration tests: Verify metrics are properly pushed to gateway and can be scraped by Prometheus in test environment.
- Performance tests: Benchmark sync operations with metrics enabled vs disabled to ensure % overhead.
Rollout Strategy
- Metrics are additive and do not change existing behavior
- No feature flag required - metrics are always available when mirror worker runs
- Backward compatible - no changes to existing quay_repository_rows_unmirrored metric
- Documentation should be published alongside release
Documentation Needs
- User-facing documentation:
- New metrics reference with descriptions and labels
- Example Prometheus queries for common monitoring scenarios
- Example alerting rules for sync failures and stale mirrors
- Developer documentation:
- Architecture decision record for metric design choices
Related Work
- Original Feature: RFE-6452
- Related Issues: RFE-7439 (Additional Prometheus metrics - incorporates this feature for broader metrics initiative)
- Enhancement Proposal: https://github.com/quay/enhancements/pull/35
- is incorporated by
-
RFE-6452 Add metrics and health endpoint for Repository Mirrors
-
- Refinement
-