-
Initiative
-
Resolution: Duplicate
-
Undefined
-
None
-
None
-
None
-
False
-
-
False
Description:{}
Introduce the ability to benchmark the performance of a single deployed model across multiple different GenAI engine configurations—starting with vLLM.
The system should support automated benchmarking of the same model while varying a range of engine-specific parameters, including but not limited to:
- max_batch_size
- max_tokens
- gpu_memory_utilization
- tensor_parallel_size
- disable_custom_all_reduce
- kv_cache_dtype
- enable_prefix_caching
- trust_remote_code
- gpu_lazy_init
- max_model_len
- max_context_len_to_capture
- sliding_window
- num_experts (for MoE models)
- cllm_paged_attention (if supported)
- engine_version (to allow version comparison)
The user (e.g.,mle) should be able to:
- Configure and launch multiple benchmark runs with different configurations
- Include full metadata for each run (model, config, hardware, workload, etc.)
- Easily compare results across configuration variants
This will support deeper analysis of configuration tradeoffs and assist product teams in selecting optimal deployment settings.