Loading...

Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Networking
Labels:

Activity Type:
Product / Portfolio Work
Parent Link:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Target Version:

openshift-4.22
Release Blocker:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
None
PX Impact Score:
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

User Defined Networks (UDNs) enable cluster administrators and application owners to create multiple logical networks for pods, allowing traffic segmentation based on workload needs (for example, separating storage, streaming, or control-plane traffic). As adoption increases, the current OVN-Kubernetes implementation exhibits significant scalability challenges when the number of UDNs grows into the hundreds.

This feature focuses on improving the scalability, performance, observability, and testability of UDNs in OVN-Kubernetes, with the immediate objective of stabilizing behavior at 450 UDNs and creating a clear path toward supporting thousands of UDNs in future releases.

Goals (aka. expected user outcomes)

Improve the scalability of OVN-Kubernetes to reliably support at least 450 User Defined Networks with acceptable performance and stability.
Identify and address performance bottlenecks related to UDN creation, reconciliation, and steady-state operation.
Reduce excessive provisioning and convergence times observed when scaling UDN counts.
Improve observability by defining and exposing meaningful metrics related to UDN scale.
Integrate scalable UDN testing into automated performance and scale pipelines to prevent regressions.
Establish a repeatable methodology that allows incremental scaling beyond 450 UDNs toward the long-term goal of 1000+ UDNs.

Requirements (aka. Acceptance Criteria):

Functional Requirements

OVN-Kubernetes must successfully create, reconcile, and maintain at least 450 UDNs without failure.
UDN creation and reconciliation times must remain within acceptable and predictable bounds.
Pod readiness latency must not degrade disproportionately as UDN count increases.
ovnkube-controller CPU utilization must remain within defined thresholds at scale.

Non-Functional Requirements

Scale improvements must be measurable using automated performance tooling.
Changes must be compatible with existing UDN APIs and user workflows.
The solution must support continuous performance validation and historical trend analysis.
Metrics related to UDN scale must be available for monitoring and alerting.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases:

OpenShift Virtualization migrations where customers want to mirror the network segmentation they had on their original virtualization platform.
Large multi-tenant clusters where many teams define isolated pod networks for security and traffic segmentation.
Workload separation scenarios such as storage, streaming, analytics, and control traffic requiring dedicated networks.
CI and scale validation environments where hundreds of UDNs are created and destroyed repeatedly.
Future-proofing clusters to support higher UDN counts as platform adoption and complexity increase.

Questions to Answer:

What should our targets be for podReadyLatency?
- Typically 13s w/o UDNs. Should be similar.
Should L2 and L3 have the same targets?
- Layer2 should be better as it brings up less components, so yes.
What specific operations (UDN creation, reconciliation, OVN object updates, controller loops) dominate time and CPU usage as UDN count increases?
Why does scaling beyond ~200 UDNs result in increased pod readiness latency and ovnkube-controller CPU utilization?
Which OVN-Kubernetes components are the primary bottlenecks at higher UDN counts?
What metrics are required to accurately represent UDN scale, health, and performance?
How can scale testing be structured to provide fast feedback to developers while remaining representative of real-world clusters?
What architectural or algorithmic changes are required to move from hundreds to thousands of UDNs?

Out of Scope

Supporting production-scale guarantees beyond the initially targeted 450 UDNs in this feature.
Non-OVN-Kubernetes CNI implementations.

Background

Initial scale testing shows that creating and stabilizing a large number of UDNs takes an excessive amount of time, including increased pod readiness latency and elevated ovnkube-controller CPU usage.

The engineering approach is to:

Start with a smaller but challenging scale target (450 UDNs),
Stabilize behavior and performance at that level,
Incrementally scale further once bottlenecks are resolved.

Automated scale testing is performed using Orion, often combined with kube-burner, as part of Scale-CI. These tools detect regressions and anomalies by comparing test results against historical baselines and trending data, helping confirm whether changes improve or degrade scalability.

Customer Considerations

Customers expect predictable pod startup times and network readiness even as network complexity grows.
Excessive UDN provisioning times can block cluster upgrades, scaling operations, and CI pipelines.
Limited observability into UDN behavior makes troubleshooting scale-related issues difficult for cluster operators.
Improvements should reduce the operational risk of adopting UDNs at larger scales and increase confidence in using UDNs for critical workloads.

Documentation Considerations

Update scalability and performance documentation to include supported and tested UDN scale limits.
Document any new metrics related to UDN scale, including their meaning and recommended alert thresholds.
Provide guidance on best practices for deploying large numbers of UDNs.
Document known limitations and expectations when approaching higher UDN counts.