Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-7787

Build vllm components and images for CPU-only x86_64 AVX2 systems

    • False
    • Hide

      None

      Show
      None
    • False
    • 0% To Do, 0% In Progress, 100% Done
    • Hide

      2026-Jan-14 green

      Builder release v26.0.0 can build vLLM for x86_64 CPU (AVX2-only). RHAIIS pipeline is building vLLM 0.13.0+rhai0 with Torch 2.9.1. InferEng team is working on container image for CPU.

      Show
      2026-Jan-14 green Builder release v26.0.0 can build vLLM for x86_64 CPU (AVX2-only). RHAIIS pipeline is building vLLM 0.13.0+rhai0 with Torch 2.9.1. InferEng team is working on container image for CPU.

      Feature title:  Build vllm components and images for CPU-only systems

      Feature Overview:

      Several things drive the need for this work:

      1. Batch inferencing jobs run on large systems using x86, Power and Z CPUs do not need the "realtime" response time provided by hosts with hardware accelerators.
      2. Components of the system, such as llama-stack, benefit from having a vllm that can run inline in a pod on any system to perform simple inferencing with small models.
      3. Partners outside of Red Hat who will provide vllm or torch plugins need the CPU build of those libraries to drive their plugins.

      Product(s) associated:

      RHAIIS: Yes
      RHEL AI: No
      RHOAI: Yes

      Goals:

      • We need to provide CPU-only builds for all CPU architectures of PyTorch and vllm.
      • We need to provide CPU-only builds of the vllm image in RHAIIS for all CPU architectures.

      Requirements:

      • CPU arch and optimizations:
        • x86_64 with AVX2 optimization
      • Torch 2.9.1
      • vLLM 0.13
      • RHAIIS vLLM image

      Done - Acceptance Criteria:

      • Component teams can install vllm and torch into their image using AIPCC base images without hardware accelerator support.
      • Partners can build on the RHAIIS CPU image to add their own plugins to provide accelerator support for accelerator types not built inside Red Hat.

      Use Cases - i.e. User Experience & Workflow:
      Include use case diagrams, main success scenarios, alternative flow scenarios.

      Out of Scope:

      CPU arches and optimizations

      • aarch64 with ARM compute library (via oneDNN)
      • ppc64le / Power
      • s390x / Z
      • x86_64 AVX512 (via oneDNN)

      We plan to deliver ARM, Power, Z, and AVX512 support in 3.4EA1.
      Additional AVX512 optimizations for x86_64v4 ISA depend on new features in vLLM 0.14+. vLLM 0.13 can either be compiled for AVX2 or AVX512. A AVX512 build does not work on older CPUs. Upcoming releases will be able to detect CPU capabilities and select the optimal implementation at runtime.

      Documentation Considerations :
      Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation.

      Original Request:

      Building vLLM to run on CPU-only systems (no GPU) for smaller models.

      List of models to validate for the initial support:

      • TinyLlama-1.1B-Chat-v1.0
      • Llama-3.2-1B-Instruct
      • granite-3.2-2b-instruct
      • TinyLlama-1.1B-Chat-v1.0-pruned2.4
      • TinyLlama-1.1B-Chat-v1.0-marlin
      • TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
      • facebook/opt-125m
      • Qwen2-0.5B-Instruct-AWQ

      GuideLLM benchmarks:
      https://developers.redhat.com/articles/2025/06/17/how-run-vllm-cpus-openshift-gpu-free-inference

      vLLM (CPU) Performance Evaluation Guide

      Midstream INFERENG CPU image build:
      quay.io/vllm/automation-vllm:cpu-19905651936

              mdean@redhat.com Meirav Dean
              troyer@redhat.com Trevor Royer
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: