-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
This epic aims to provide best practices for configuring OpenShift Service Mesh (OSSM/Istio/Envoy) for high performance and large-scale environments. It covers the documentation of scaling behaviors of OSSM components like istiod, gateway, ztunnel, waypoint, and sidecars, defines relevant inputs (e.g. number of pods, services, namespaces, clusters, configuration size) and key metrics like throughput, latency, CPU/memory use. We will identify common usage scenarios (REST throughput, websocket latency, resource efficiency for internal traffic, ambient vs. sidecar, etc.) and develop optimization strategies for each. The epic also includes investigating customer usage patterns to inform our guidance. While the outcome of this epic should influence our perf&scale automated tests (ie terminology and scenarios should match), it does not cover creation of the tests themselves.
Original Description:
There is a broad range of topics that fall under "performance and scale" for service mesh that we are often asked to address. This original issue was created for OpenShift AI, but this is applicable to any user of OpenShift Service Mesh (Istio) or Envoy. A common question is "How can I setup service mesh (or Ingress) for high load or scale?".
Questions this issue should look to answer:
- How do we recommend OSSM be configured to adapt to a high performance deployment (large number of requests per second)? How should resources be adjusted as these parameters increase?
- This question can be answered for:
- A standalone Istio gateway (such as OpenShift Ingress w/ Gateway API support)
- An Istio gateway + an example application (Bookinfo) with sidecars and mTLS encryption enabled
- An Istio gateway + an example application (Bookinfo) with ambient mode (ZTunnels) mTLS encryption enabled
- An Istio gateway + an example application (Bookinfo) with ambient mode (ZTunnels + Waypoint proxy) mTLS encryption enabled
- How do we recommend OSSM be configured for a high-scale (large number of services, namespaces, nodes, clusters) environment? How should resources be adjusted as these parameters increase?
It should also be noted that the tests/infrastructure to carry this out should be the same across product vs upstream, so to the greatest degree possible this should use / contribute to upstream project material. With that said, as upstream Istio uses boringSSL vs downstream Istio using OpenSSL, there is value in upstream vs downstream comparisons.
From the original issue:
Adding this as a story from the mail thread "Questions about Service Mesh scaling/higher load scenarios"
With regards to OpenShift AI scale and performance testing, we (OpenShift Serverless) are doing our own testing to have some numbers for OpenShift Serverless with and without Service-Mesh integration. Regarding that, we've got some questions:
In general, do you have any recommendations or best practices on how to set up Service Mesh for larger environments and/or environments with high(er) load scenarios? For example we think it's a good idea to enable autoscaling of istio-ingressgateway and increase requests/limits for istio-ingressgateways and istio-proxies. Anything else that we should be aware of?
We should also provide sizing guidance for the Istio sidecar proxies, gateways and control plane, so that we can add a new perf & scale section to our product documentation.
Related information
- blocks
-
SRVKS-1075 Performance Benchmarking for Serving
-
- Backlog
-