-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
None
-
False
-
None
-
False
-
-
Request based autoscaling is not ideal for LLMs because they don't have a regular correlation between the number of requests and the resource utilization. Thus, concurrency is not a good option for autoscaling.
User needs to autoscale AI workloads based on other metrics such as throughput and latency. Use cases reported at the K8s Serving WG:
https://docs.google.com/document/d/1IFsCwWtIGMujaZZqEMR4ZYeZBi7Hb1ptfImCa1fFf1A/edit?resourcekey=0-8lD1pc_wDVxiwyI8SIhBCw#heading=h.msa1v1j90u
Metrics doc from the same WG: https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk/edit?resourcekey=0-ob5dR-AJxLQ5SvPlA4rdsg#heading=h.qmzyorj64um1
This is also a request that was reported by the RHAI field folks.
This issue is about KServe and Serving integration (serverless mode), adding support for custom metrics via KEDA.
At the KServe side there is already a PR to support KEDA integration with raw deployments mode.
- depends on
-
SRVKS-1224 Support custom metrics with the Keda HPA based autoscaler
- In Progress