Uploaded image for project: 'OpenShift Service Mesh'
  1. OpenShift Service Mesh
  2. OSSM-11824

Document specific parameters impacting perf & scale of Istio

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • OSSM 3.2.0
    • Envoy, Istio, Ztunnel
    • None

      Improving the performance and scalability of Istio is a common challenge customers have. While we can give guidance on making mesh more scalable (such as OSSM-8271), it can be very difficult to identify the source of perf/scale issues and then take action to correct/mitigate them once they have occurred.

      Istio's current documentation on performance and scale is lacking in specifics. It mentions many factors that can impact performance, but does not tell the user how to measure these factors or what to do about them. For example, it mentions "number of client connections", but does not tell the user how to measure this, nor any guidance on determining which factor could be the bottleneck.

      A better data plane performance doc would list the specific metrics and configuration parameters to check, whether they are from Istio or Kubernetes, along with an explanation of the impact of that metric/parameter. It could also give examples of the expected values of the parameter (a potential range based on Istio's typical performance characteristics). It would also provide guidance on how to handle a situation where the metric/parameter was elevated.

      jewertow@redhat.com put together a guide similar to this for the Istio control plane tuning, and we need to do the same for the data plane. We can consider incorporating this too, though for this issue, the data plane should be the priority.

      Expected outcome:

      The outcome of this issue should be an indepth guide on performance (and scale) tuning of the Istio dataplane (first sidecar, then ambient) that includes specific logs/metrics/parameters to be watched/tuned. This can be published to the openshift-service-mesh github repo.

      It should also raise gaps in available telemetry to be considered for RFEs. For example, https://github.com/kiali/kiali/issues/9003.

      Keep in mind, that the audience for this doc is both human (users) and machine (LLMs).

      Next steps:

      The content could be considered for upstream contribution (a better Istio performance page), product documentation - or as a resource for Kiali to:

      • Kiali enhancements to help users with Istio dataplane performance
      • As context to Lightspeed to be used with a Kiali MCP server so that users can get actionable advice based on their actual mesh metrics.

              Unassigned Unassigned
              jlongmui@redhat.com Jamie Longmuir
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: