Task Description (Required)

We already have the granite3-dense:8b AKA Granite-3.0-8B-Instruct model hosted on ollama on our dev RHOAI cluster.

This request is to host it using VLLM.

The disadvantage of hosting models in ollama is the switching cost. Every time someone makes an inference API call for a model that's not loaded into memory, llamacpp/ollama swaps out the old model and swaps in the new one. For larger models this swap-out time can be noticeable.

Steps

Onboard a new node with a GPU. A smaller one should be sufficient.
Create a new VLLM model inference server for Granite-3.0-8B-Instruct.
Register/advertise the new inference server through the API gateway.
Test.

clones

RHIDP-9972 Onboard our software templates to DEVAI RHDH instance

Refinement

depends on

RHIDP-10406 Create Software Template for No Application (Model Server only)

Closed

is related to

RHIDP-10406 Create Software Template for No Application (Model Server only)

Closed

Assignee:: John Collier

Reporter:: John Collier

Team:: RHIDP - AI

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/11/18 8:27 PM

Updated:: 2025/11/25 9:11 PM

Resolved:: 2025/01/14 9:29 PM

Details

Description

Task Description (Required)

Steps

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates