-
Feature
-
Resolution: Unresolved
-
Critical
-
None
-
None
Feature title: Build vllm components and images for CPU-only systems, part 2
Feature Overview:
Several things drive the need for this work:
- Batch inferencing jobs run on large systems using x86, Power and Z CPUs do not need the "realtime" response time provided by hosts with hardware accelerators.
- Components of the system, such as llama-stack, benefit from having a vllm that can run inline in a pod on any system to perform simple inferencing with small models.
- Partners outside of Red Hat who will provide vllm or torch plugins need the CPU build of those libraries to drive their plugins.
Product(s) associated:
RHAIIS: Yes
RHEL AI: No
RHOAI: Yes
Goals:
- We need to provide CPU-only builds for all CPU architectures of PyTorch and vllm.
- We need to provide CPU-only builds of the vllm image in RHAIIS for all CPU architectures.
Requirements:
- CPU arch and optimizations:
- aarch64 (via oneDNN)
- ppc64le / Power
- s390x / Z
- x86_64v4 (AVX512 via oneDNN)
- Torch ??
- vLLM ??
- RHAIIS vLLM image
Done - Acceptance Criteria:
- Component teams can install vllm and torch into their image using AIPCC base images without hardware accelerator support.
- Partners can build on the RHAIIS CPU image to add their own plugins to provide accelerator support for accelerator types not built inside Red Hat.
Use Cases - i.e. User Experience & Workflow:
Include use case diagrams, main success scenarios, alternative flow scenarios.
Out of Scope:
Documentation Considerations :
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation.
Original Request:
Building vLLM to run on CPU-only systems (no GPU) for smaller models.
List of models to validate for the initial support:
- TinyLlama-1.1B-Chat-v1.0
- Llama-3.2-1B-Instruct
- granite-3.2-2b-instruct
- TinyLlama-1.1B-Chat-v1.0-pruned2.4
- TinyLlama-1.1B-Chat-v1.0-marlin
- TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
- facebook/opt-125m
- Qwen2-0.5B-Instruct-AWQ
GuideLLM benchmarks:
https://developers.redhat.com/articles/2025/06/17/how-run-vllm-cpus-openshift-gpu-free-inference
vLLM (CPU) Performance Evaluation Guide
Midstream INFERENG CPU image build:
quay.io/vllm/automation-vllm:cpu-19905651936
- clones
-
AIPCC-7787 Build vllm components and images for CPU-only x86_64 AVX2 systems
-
- Review
-
- is duplicated by
-
AIPCC-8766 Build vllm components and images for CPU-only x86_64 AVX512 systems
-
- Closed
-
- is related to
-
AIPCC-8766 Build vllm components and images for CPU-only x86_64 AVX512 systems
-
- Closed
-