-
Story
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
-
False
-
-
Goal:
As a user of RHEL AI, I want the best possible performance and throughput out of my inference server.
vLLM's v1 engine is a flagship feature of recent vLLM releases, but we hit a bug late in the RHEL AI 1.5 cycle (RHELAI-4084) that required us to disable vLLM v1 for Nvidia accelerators. That bug fix has been merged upstream at https://github.com/vllm-project/vllm/pull/17855 but as of creating this issue has not yet been released.
Once that gets into a vLLM release (likely 0.8.6 or later), we need to test reverting the swap to vLLM v0 (ie reverting https://gitlab.com/redhat/rhel-ai/containers/instructlab-nvidia/-/merge_requests/600) and ensure that inference and specifically SDG work with the v1 engine.
Acceptance Criteria:
- The instructlab-nvidia container does not disable vLLM v1 via an environment variable.
- Serving models via `ilab model serve` uses the vLLM v1 engine and works with our supported models (granite variants, mixtral w/ adapters, prometheus).
- `ilab data generate` completes successfully in the vLLM v1 engine with our default agentic pipeline that runs against the mixtral teacher model with skills/knowledge adapters