-
Feature
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
Product / Portfolio Work
-
None
-
100% To Do, 0% In Progress, 0% Done
-
False
-
-
False
-
None
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
Feature Overview (aka. Goal Summary)
This feature introduces support for configuring the maximum number of router connections exposed to Prometheus monitoring.
By making router max connection limits configurable and observable, cluster administrators gain improved visibility into router capacity, saturation risk, and scaling behavior under varying traffic loads.
Goals (aka. expected user outcomes)
Primarily to avoid hitting a hard-configured "ceiling" during times of extreme traffic, but also:
- Enable administrators to configure a router max connections value that is surfaced through Prometheus metrics.
- Improve observability into router capacity utilization and connection pressure.
- Support proactive alerting on router saturation risks before service degradation occurs.
- Allow alignment of router monitoring data with custom deployment sizes, traffic profiles, and SLAs.
- Maintain backward compatibility for clusters that rely on existing default behavior.
Requirements (aka. Acceptance Criteria):
Functional Requirements
- Provide a configurable parameter to define the maximum number of router connections.
- Expose this value via Prometheus-compatible metrics.
- Ensure metrics clearly distinguish between:
- Configured maximum connections
- Current/active connections
- Support dynamic updates through supported configuration mechanisms (e.g., operator-managed configuration).
Non-Functional Requirements
- No significant performance impact on router dataplane operations.
- Metrics must follow existing Prometheus naming and labeling conventions.
- Defaults must preserve existing behavior when no configuration is provided.
- Configuration changes should be observable without requiring cluster downtime.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
| Deployment considerations | List applicable specific needs (N/A = not applicable) |
| Self-managed, managed, or both | |
| Classic (standalone cluster) | |
| Hosted control planes | |
| Multi node, Compact (three node), or Single node (SNO), or all | |
| Connected / Restricted Network | |
| Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
| Operator compatibility | |
| Backport needed (list applicable versions) | |
| UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
| Other (please specify) |
Use Cases:
- Capacity Planning: Operators track router connection utilization relative to configured limits to determine when to scale routers or adjust traffic distribution.
- Alerting: Platform teams configure alerts when active connections approach a configurable percentage of the maximum.
- Multi-Tenant Clusters: Administrators tune router connection limits to match tenant traffic expectations and avoid noisy-neighbor scenarios.
- Performance Troubleshooting: SREs correlate connection pressure with latency, error rates, or dropped connections during incident analysis.
Out of Scope
Background
Routers play a critical role in handling ingress traffic and maintaining client connections. While active connection metrics are commonly available, the lack of a configurable and observable maximum connection reference point limits the effectiveness of monitoring and alerting. Static or implicit limits can cause headroom issues, especially in clusters during times of extreme traffic (e.g. holiday sales) and with diverse workloads or custom router deployments. Providing explicit, configurable max connection values improves clarity and operational confidence.
Customer Considerations
- Backward Compatibility: Existing clusters should continue to function without requiring configuration changes.
- Simplicity: Configuration should be easy to understand and manage through existing tooling.
- Documentation: Clear guidance must be provided on:
- Recommended values
- How to interpret metrics
- How to build alerts based on the new data
- Safety: Misconfiguration should be mitigated through validation or sensible defaults to prevent misleading metrics.
Documentation Considerations
Interoperability Considerations
- Fully compatible with existing Prometheus deployments and dashboards.
- Integrates with standard alerting frameworks (e.g., Alertmanager).
- Works alongside existing router metrics without breaking dashboards or queries.
- Supports interoperability with downstream observability tools that consume Prometheus metrics (e.g., Grafana).
- Aligns with operator-managed configuration models and does not require custom patches or sidecars.