-
Task
-
Resolution: Done
-
Major
-
None
-
None
-
5
-
False
-
-
False
-
-
-
RHDHPAI Sprint 3268
Task Description (Required)
We already have the granite3-dense:8b AKA Granite-3.0-8B-Instruct model hosted on ollama on our dev RHOAI cluster.
This request is to host it using VLLM.
The disadvantage of hosting models in ollama is the switching cost. Every time someone makes an inference API call for a model that's not loaded into memory, llamacpp/ollama swaps out the old model and swaps in the new one. For larger models this swap-out time can be noticeable.
Steps
- Onboard a new node with a GPU. A smaller one should be sufficient.
- Create a new VLLM model inference server for Granite-3.0-8B-Instruct.
- Register/advertise the new inference server through the API gateway.
- Test.
- clones
-
RHIDP-9972 Onboard our software templates to DEVAI RHDH instance
-
- Refinement
-
- depends on
-
RHIDP-10406 Create Software Template for No Application (Model Server only)
-
- Closed
-
- is related to
-
RHIDP-10406 Create Software Template for No Application (Model Server only)
-
- Closed
-