Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: AIPCC Productization, Development Platform
Labels:
- Approved-RFE

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Hierarchy Progress Bar:

0% To Do, 100% In Progress, 0% Done

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Feature title: Build vllm components and images for CPU-only systems, part 2

Feature Overview:

Several things drive the need for this work:

Batch inferencing jobs run on large systems using x86, Power and Z CPUs do not need the "realtime" response time provided by hosts with hardware accelerators.
Components of the system, such as llama-stack, benefit from having a vllm that can run inline in a pod on any system to perform simple inferencing with small models.
Partners outside of Red Hat who will provide vllm or torch plugins need the CPU build of those libraries to drive their plugins.

Product(s) associated:

RHAIIS: Yes
RHEL AI: No
RHOAI: Yes

Goals:

We need to provide CPU-only builds for all CPU architectures of PyTorch and vllm.
We need to provide CPU-only builds of the vllm image in RHAIIS for all CPU architectures.

Requirements:

CPU arch and optimizations:
- aarch64 (via oneDNN)
- ppc64le / Power
- s390x / Z
- x86_64v4 (AVX512 via oneDNN)

Torch ??
vLLM ??
RHAIIS vLLM image

Done - Acceptance Criteria:

Component teams can install vllm and torch into their image using AIPCC base images without hardware accelerator support.
Partners can build on the RHAIIS CPU image to add their own plugins to provide accelerator support for accelerator types not built inside Red Hat.

Use Cases - i.e. User Experience & Workflow:
Include use case diagrams, main success scenarios, alternative flow scenarios.

Out of Scope:

Documentation Considerations :
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation.

Original Request:

Building vLLM to run on CPU-only systems (no GPU) for smaller models.

List of models to validate for the initial support:

TinyLlama-1.1B-Chat-v1.0
Llama-3.2-1B-Instruct
granite-3.2-2b-instruct
TinyLlama-1.1B-Chat-v1.0-pruned2.4
TinyLlama-1.1B-Chat-v1.0-marlin
TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
facebook/opt-125m
Qwen2-0.5B-Instruct-AWQ

GuideLLM benchmarks:
https://developers.redhat.com/articles/2025/06/17/how-run-vllm-cpus-openshift-gpu-free-inference

vLLM (CPU) Performance Evaluation Guide

Midstream INFERENG CPU image build:
quay.io/vllm/automation-vllm:cpu-19905651936

clones

AIPCC-7787 Build vllm components and images for CPU-only x86_64 AVX2 systems

Closed

is duplicated by

AIPCC-8766 Build vllm components and images for CPU-only x86_64 AVX512 systems

Closed

is related to

AIPCC-8766 Build vllm components and images for CPU-only x86_64 AVX512 systems

Closed

Assignee:: Meirav Dean

Reporter:: Trevor Royer

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2026/01/16 1:33 PM

Updated:: 2026/02/05 6:18 PM

Details

Description

Product(s) associated:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty