-
Feature
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
False
-
-
False
-
Not Selected
NOTE: DELIVERY OF THIS FEATURE WILL SPAN MULTIPLE REALEASES. THE INTENT IS TO START SOME WORK AND DESIGN DURING THE RHEL AI 1.5.z/2.x CYCLE
Feature Overview:
LLMCompressor is an easy-to-use library for optimizing models for deployment with vLLM using a comprehensive set of quantization algorithms.
The LLMCompressor integration for RHEL AI will provide a user-friendly library for optimizing fine-tuned models for use cases where the organization requires to run inferencing on smaller GPUs.
https://github.com/vllm-project/llm-compressor/blob/main/README.md
This feature is to extend InstructLab with a new command or subcommand (`ilab compress...` or `ilab model compress`) for users to benefit from comprehensive set of quantization algorithms for weight-only and activation quantization, ensuring efficient model optimization.
Additionally, it will offer seamless integration with Hugging Face models and repositories, making it easy for subject matter experts to use. The LLMCompressor will also support large models via accelerate, utilizing the safetensors-based file format compatible with vLLM.
Goals:
- Integrate LLMCompressor in RHEL AI to onboard a comprehensive set of quantization algorithms for weight-only and activation quantization.
- Enable users to transform a full-resolution model into a high-quality quantized model.
- Enable a user to run a simple 'ilab compress' command to do PTQ on their fine-tuned model or a pre-trained model that is validated in InstructLab/RHEL AI.
Background:
RHEL AI is an InstructLab distribution designed to enable subject matter experts (SME) to fine-tune an LLM for their business use case. Sometimes, these use cases require the inferencing of fine-tuned LLM to be executed on edge devices with limited GPU. Integrating LLMCompressor will provide a valuable tool for optimizing models, ensuring efficient deployment with vLLM for these edge use cases.
Considerations
LLMCompressor requires its internal "fine-tuning" to replay the fine-tuning dataset using the specific quantization profile selected.
- The productization should consider an SDG dataset as the dataset for fine-tuning the quantization level
- The InstructLab CLI should expose simple flags mapping to prevalidated quantization profiles, while the SDK API should expose a richer set of knobs
- E.g., `ilab model compress --anchor-quality <low|medium|high>` where each one of this maps a validated best combination of parameters for that resolution
Done:
- [ ] LLMCompressor pulled as dependency validated for release
- [ ] LLMCompressor integrated as command of subcommand allowing the users to quantize the model
- [ ] Explore the feasibility of LLMCompressor fine-tuning steps using the Instructlab fine-tuning flow over the SDK API
- [ ] Integration tests completed.
- [ ] Documentation updated to reflect the support of LLMCompressor.
Questions to Answer:
- What are the specific quantization knobs we should expose to users over CLI vs over the SDK API
- Are there any specific best practices that should be followed while implementing LLMCompressor?
- Should LLMCOmpressor consume the InstructLab fine-tuning classes or maintain its own fine-tuning flow?
- How will this impact the support of accelerators for different vendors
Customer Considerations:
1. An SME user should be able to easily benefit from LLMCompressor.