-
Feature
-
Resolution: Done
-
Critical
-
rhelai-1.5
Feature Overview:
This Feature card is part of validating 3rd-party inference models in vllm inference flow for RHELAI 1.5. This is separate from the ilab model serve inference validation.
3rd-party model for this card: Llama 3.3 70B Instruct
Goals:
- Serve Llama 3.3 70B with vllm in RHELAI 1.5 - functional test
- Chat with it to confirm it functions
- No errors/warnings arise
- Start documentation for MVP vllm inferencing on RHELAI 1.5
- Run for all quantized variants of the model (Base, INT4, INT8, FP8)
Out of Scope [To be updated post-refinement]:
- Ilab model serve functional testing, this is a separate endeavor
Requirements:
- Documentation to be updated to reflect the workaround for directly deploying vllm for inferencing on RHELAI
- Specifying the entrypoint command to run vllm when the container starts: ie. for Podman
- Specifying how models are downloaded and referenced in the command to be served
-
ENTRYPOINT=/opt/app-root/bin/vllm IMAGE=registry.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.4-1738905416 podman run --rm -ti \ --device "nvidia.com/gpu=all" \ --security-opt "label=disable" \ --net host \ --shm-size 10G \ --pids-limit -1 \ -v $HOME:$HOME \ --entrypoint $ENTRYPOINT \ $IMAGE \ serve ~/models/a57d425d-80c2-4361-bbf7-23f1262ceea1 --served-model-name wcabanba0308sves-mlang-skill --host 127.0.0.1 --port 8000
- All base and quantized variants of the model are able to be served via this workaround
Done - Acceptance Criteria:
- QE ensures all functional requirements are met
Model Quantization Level Confirmed Llama 3.3 70B Instruct Baseline Llama 3.3 70B Instruct INT4 INT4 Llama 3.3 70B Instruct INT8 INT8 Llama 3.3 70B Instruct FP8 FP8
- Documentation is updated
- All base and quantized versions of the Model are confirmed to meet the requirements and all have a 'X' in the Confirmed boxes
Use Cases - i.e. User Experience & Workflow:
- User downloads the model via quay or huggingface
- User serves the model following the documentation to bypass other components of the RHELAI container and serves directly on vllm
Documentation Considerations:{}
- See requirements
Questions to answer:
- Which vllm version is in the 1.4 vs what is planned for 1.5?
Background & Strategic Fit:
Customers have been asking to leverage the latest and greatest third-party models from Meta, Mistral, Microsoft, Qwen, etc. within Red Hat AI Products. As our they continue to adopt and deploy OS models, the third-party model validation pipeline provides inference performance benchmarking and accuracy evaluations for third-party models to give customers confidence and predictability bringing third-party models to Instruct Lab and vLLM within RHEL AI and RHOAI.
See Red Hat AI Model Validation Strategy Doc
See Redhat Q1 2025 Third Party Model Validation Presentation
- clones
-
RHELAI-3622 Qwen-2.5 7B-Instruct RHELAI vllm inference flow
-
- Closed
-
- is cloned by
-
RHELAI-3628 Llama 3.1 8B Instruct RHELAI vllm inference flow
-
- Closed
-