-
Feature
-
Resolution: Unresolved
-
Critical
-
None
-
None
Feature title: Build vllm components and images for CPU-only systems
Feature Overview:
Several things drive the need for this work:
- Batch inferencing jobs run on large systems using x86, Power and Z CPUs do not need the "realtime" response time provided by hosts with hardware accelerators.
- Components of the system, such as llama-stack, benefit from having a vllm that can run inline in a pod on any system to perform simple inferencing with small models.
- Partners outside of Red Hat who will provide vllm or torch plugins need the CPU build of those libraries to drive their plugins.
Product(s) associated:
RHAIIS: Yes
RHEL AI: No
RHOAI: Yes
Goals:
- We need to provide CPU-only builds for all CPU architectures of PyTorch and vllm.
- We need to provide CPU-only builds of the vllm image in RHAIIS for all CPU architectures.
Requirements:
- CPU arch and optimizations:
- x86_64 with AVX2 optimization
- Torch 2.9.1
- vLLM 0.13
- RHAIIS vLLM image
Done - Acceptance Criteria:
- Component teams can install vllm and torch into their image using AIPCC base images without hardware accelerator support.
- Partners can build on the RHAIIS CPU image to add their own plugins to provide accelerator support for accelerator types not built inside Red Hat.
Use Cases - i.e. User Experience & Workflow:
Include use case diagrams, main success scenarios, alternative flow scenarios.
Out of Scope:
CPU arches and optimizations
- aarch64 with ARM compute library (via oneDNN)
- ppc64le / Power
- s390x / Z
- x86_64 AVX512 (via oneDNN)
We plan to deliver ARM, Power, Z, and AVX512 support in 3.4EA1.
Additional AVX512 optimizations for x86_64v4 ISA depend on new features in vLLM 0.14+. vLLM 0.13 can either be compiled for AVX2 or AVX512. A AVX512 build does not work on older CPUs. Upcoming releases will be able to detect CPU capabilities and select the optimal implementation at runtime.
Documentation Considerations :
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation.
Original Request:
Building vLLM to run on CPU-only systems (no GPU) for smaller models.
List of models to validate for the initial support:
- TinyLlama-1.1B-Chat-v1.0
- Llama-3.2-1B-Instruct
- granite-3.2-2b-instruct
- TinyLlama-1.1B-Chat-v1.0-pruned2.4
- TinyLlama-1.1B-Chat-v1.0-marlin
- TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
- facebook/opt-125m
- Qwen2-0.5B-Instruct-AWQ
GuideLLM benchmarks:
https://developers.redhat.com/articles/2025/06/17/how-run-vllm-cpus-openshift-gpu-free-inference
vLLM (CPU) Performance Evaluation Guide
Midstream INFERENG CPU image build:
quay.io/vllm/automation-vllm:cpu-19905651936
- is cloned by
-
AIPCC-8765 Build vllm components and images for CPU-only systems (aarch64, Power, Z, x86_64v4)
-
- Refinement
-
-
AIPCC-8766 Build vllm components and images for CPU-only x86_64 AVX512 systems
-
- Closed
-
- relates to
-
AIPCC-7460 Build Python wheels on IBM Power to publish them to RH Public index
-
- Review
-