Loading...

XML

Word

Printable

Type: Feature
Resolution: Done
Priority: Critical
Fix Version/s: rhelai-1.5
Affects Version/s: rhelai-1.5
Component/s: InstructLab - Core, Instructlab - Research, InstructLab - Training
Labels:
- 1.5-candidate
- model-validation

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Parent Link:
RHELAI-3557RHEL AI Third-Party Model Validation Deliverables for Summit '25

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Feature Overview:
This Feature card is part of validating 3rd-party inference models in vllm inference flow for RHELAI 1.5. This is separate from the ilab model serve inference validation.

3rd-party model for this card: Llama 3.3 70B Instruct

Goals:

Serve Llama 3.3 70B with vllm in RHELAI 1.5 - functional test
- Chat with it to confirm it functions
- No errors/warnings arise
Start documentation for MVP vllm inferencing on RHELAI 1.5
Run for all quantized variants of the model (Base, INT4, INT8, FP8)

Out of Scope [To be updated post-refinement]:

Ilab model serve functional testing, this is a separate endeavor

Requirements:

Documentation to be updated to reflect the workaround for directly deploying vllm for inferencing on RHELAI
- Specifying the entrypoint command to run vllm when the container starts: ie. for Podman
- Specifying how models are downloaded and referenced in the command to be served

ENTRYPOINT=/opt/app-root/bin/vllm
IMAGE=registry.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.4-1738905416

podman run --rm -ti \
--device "nvidia.com/gpu=all" \
--security-opt "label=disable" \
--net host \
--shm-size 10G \
--pids-limit -1 \
-v $HOME:$HOME \
--entrypoint $ENTRYPOINT \
$IMAGE \
serve ~/models/a57d425d-80c2-4361-bbf7-23f1262ceea1 --served-model-name wcabanba0308sves-mlang-skill --host 127.0.0.1 --port 8000

All base and quantized variants of the model are able to be served via this workaround

Done - Acceptance Criteria:

QE ensures all functional requirements are met
Model Quantization Level Confirmed

Llama 3.3 70B Instruct Baseline

Llama 3.3 70B Instruct INT4 INT4

Llama 3.3 70B Instruct INT8 INT8

Llama 3.3 70B Instruct FP8 FP8

Documentation is updated
All base and quantized versions of the Model are confirmed to meet the requirements and all have a 'X' in the Confirmed boxes

Use Cases - i.e. User Experience & Workflow:

User downloads the model via quay or huggingface
User serves the model following the documentation to bypass other components of the RHELAI container and serves directly on vllm

Documentation Considerations:{}

See requirements

Questions to answer:

Which vllm version is in the 1.4 vs what is planned for 1.5?

Background & Strategic Fit:

Customers have been asking to leverage the latest and greatest third-party models from Meta, Mistral, Microsoft, Qwen, etc. within Red Hat AI Products. As our they continue to adopt and deploy OS models, the third-party model validation pipeline provides inference performance benchmarking and accuracy evaluations for third-party models to give customers confidence and predictability bringing third-party models to Instruct Lab and vLLM within RHEL AI and RHOAI.

See Red Hat AI Model Validation Strategy Doc

See Redhat Q1 2025 Third Party Model Validation Presentation

clones

RHELAI-3622 Qwen-2.5 7B-Instruct RHELAI vllm inference flow

Closed

is cloned by

RHELAI-3628 Llama 3.1 8B Instruct RHELAI vllm inference flow

Closed

Assignee:: Rob Greenberg

Reporter:: Rob Greenberg

Contributors:: Jenny Yi

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/03/11 9:55 PM

Updated:: 2025/03/13 9:39 PM

Resolved:: 2025/03/13 9:39 PM

Model	Quantization Level	Confirmed
Llama 3.3 70B Instruct	Baseline
Llama 3.3 70B Instruct INT4	INT4
Llama 3.3 70B Instruct INT8	INT8
Llama 3.3 70B Instruct FP8	FP8

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates