-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
Current distribution of vLLM supports NVIDIA GPU, Intel Gaudi, and ROCm. It would be great to have a version of vLLM that is capable of running on CPU without a GPU for smaller models.
The strat is limited to only x86 support.
List of models to validate for the initial support:
- TinyLlama-1.1B-Chat-v1.0
- Llama-3.2-1B-Instruct
- granite-3.2-2b-instruct
- TinyLlama-1.1B-Chat-v1.0-pruned2.4
- TinyLlama-1.1B-Chat-v1.0-marlin
- TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
- facebook/opt-125m
- Qwen2-0.5B-Instruct-AWQ
Models performance evaluation resources / guides:
- https://developers.redhat.com/articles/2025/06/17/how-run-vllm-cpus-openshift-gpu-free-inference
- vLLM (CPU) Performance Evaluation Guide
- Performance Evaluation Guide For embedding models leveraging vllm bench serve
Midstream INFERENG CPU image build:
quay.io/vllm/automation-vllm:cpu-19905651936
In addition to the first delivery in RHAIIS 3.3 to support AVX2, this second delivery should support AVX2, AVX512, and AVX512 AMX in the same build.
- relates to
-
AIPCC-7787 Build vllm components and images for CPU-only x86_64 AVX2 systems
-
- Closed
-