Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: RHOAI
Labels:
- llm-d

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

1. Proposed title of this feature request

Support Topology-Aware Gang Scheduling

2. What is the nature and description of the request?

Add support for topology-aware scheduling to OpenShift, which will allow llm-d to optimize the placement of vLLM pods, reducing the cost of inference.

Distributed LLM inference performance involves many kinds of GPU-to-GPU communication and as such is highly sensitive to the physical placement of pods on the underlying hardware.

Tensor parallelism relies on dense AllReduce collective communication operations provided by libraries like NCCL or RCCL. These AllReduces, typically used in single-node inference deployments, are faster if all worker GPUs reside within a PCIe domain.

WideEP deployments for multinode mixture-of-expert models use sparse all-to-all dispatch and combine operations like those in DeepEP. These All2All ops are one of the most expensive costs of inference at scale and are faster if traffic between racks is minimized. This is particularly true for multi-node NVLINK (MNNVL) systems like the GB200 NVL72. Additionally gang scheduling provides simultaneous scheduling of all pods in a deployment that may involve a dozen or more nodes, requiring the composability of both gang-scheduling and topology-aware scheduling.

Prefill-decode (P/D) disaggregation is a technique that splits the phases of inference into separate deployments. This allows specialization of how the prefill and decode phases are parallelized, and can also drastically reduce P99 inter-token latency. P/D disaggregation relies on libraries like NIXL to transfer KV cache state between inference pods using RDMA over interconnects like Infiniband, RoCE, or NVLINK. Placing a prefill and decode instance on the same node allows the use of the faster NVLINK interconnect for transfers between them, whereas cross-node KV transfers may use the slower Infiniband or RoCE.

Topology-aware scheduling of vLLM pods will allow llm-d to optimize communication operations, which is critically important for cluster-scale inference.

3. Why does the customer need this? (List the business requirements here)

This feature will support the core functionality of llm-d – optimizing the cost of cluster-scale inference.

4. List any affected packages or components.
llm-d

depends on

OCPSTRAT-1786 Gang Scheduling for OpenShift

In Progress

Assignee:: Gaurav Singh

Reporter:: Tyler Michael Smith

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/11/18 7:36 PM

Updated:: 2025/11/20 4:09 AM

Target start:: None

Target end:: None

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates