Loading...

XML

Word

Printable

Type: Initiative
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: Model Validation
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Description:{}

Introduce the ability to benchmark the performance of a single deployed model across multiple different GenAI engine configurations—starting with vLLM.

The system should support automated benchmarking of the same model while varying a range of engine-specific parameters, including but not limited to:

max_batch_size

max_tokens

gpu_memory_utilization

tensor_parallel_size

disable_custom_all_reduce

kv_cache_dtype

enable_prefix_caching

trust_remote_code

gpu_lazy_init

max_model_len

max_context_len_to_capture

sliding_window

num_experts (for MoE models)

cllm_paged_attention (if supported)

engine_version (to allow version comparison)

The user (e.g.,mle) should be able to:

Configure and launch multiple benchmark runs with different configurations

Include full metadata for each run (model, config, hardware, workload, etc.)

Easily compare results across configuration variants

This will support deeper analysis of configuration tradeoffs and assist product teams in selecting optimal deployment settings.

Assignee:: Aviran Badli

Reporter:: Aviran Badli

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/06/26 6:37 PM

Updated:: 2025/08/10 7:17 PM

Resolved:: 2025/08/10 7:17 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty