Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3208

LLMCompressor Integration for RHEL AI (phase1)

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected

      NOTE: DELIVERY OF THIS FEATURE WILL SPAN MULTIPLE REALEASES. THE INTENT IS TO START SOME WORK AND DESIGN DURING THE RHEL AI 1.5.z/2.x CYCLE

      Feature Overview:

      LLMCompressor is an easy-to-use library for optimizing models for deployment with vLLM using a comprehensive set of quantization algorithms.

      The LLMCompressor integration for RHEL AI will provide a user-friendly library for optimizing fine-tuned models for use cases where the organization requires to run inferencing on smaller GPUs.

      https://github.com/vllm-project/llm-compressor/blob/main/README.md 

      This feature is to extend InstructLab with a new command or subcommand (`ilab compress...` or `ilab model compress`) for users to benefit from comprehensive set of quantization algorithms for weight-only and activation quantization, ensuring efficient model optimization.

      Additionally, it will offer seamless integration with Hugging Face models and repositories, making it easy for subject matter experts to use. The LLMCompressor will also support large models via accelerate, utilizing the safetensors-based file format compatible with vLLM.

      Goals:

      1. Integrate LLMCompressor in RHEL AI to onboard a comprehensive set of quantization algorithms for weight-only and activation quantization.
      2. Enable users to transform a full-resolution model into a high-quality quantized model. 
      3. Enable a user to run a simple 'ilab compress' command to do PTQ on their fine-tuned model or a pre-trained model that is validated in InstructLab/RHEL AI. 

      Background:

      RHEL AI is an InstructLab distribution designed to enable subject matter experts (SME) to fine-tune an LLM for their business use case. Sometimes, these use cases require the inferencing of fine-tuned LLM to be executed on edge devices with limited GPU. Integrating LLMCompressor will provide a valuable tool for optimizing models, ensuring efficient deployment with vLLM for these edge use cases.

      Considerations

      LLMCompressor requires its internal "fine-tuning" to replay the fine-tuning dataset using the specific quantization profile selected. 

      1. The productization should consider an SDG dataset as the dataset for fine-tuning the quantization level
      2. The InstructLab CLI should expose simple flags mapping to prevalidated quantization profiles, while the SDK API should expose a richer set of knobs
        • E.g., `ilab model compress --anchor-quality <low|medium|high>` where each one of this maps a validated best combination of parameters for that resolution

      Done:

      1. [ ] LLMCompressor pulled as dependency validated for release
      2. [ ] LLMCompressor integrated as command of subcommand allowing the users to quantize the model
      3. [ ] Explore the feasibility of LLMCompressor fine-tuning steps using the Instructlab fine-tuning flow over the SDK API
      4. [ ] Integration tests completed.
      5. [ ] Documentation updated to reflect the support of LLMCompressor.

      Questions to Answer:

      1. What are the specific quantization knobs we should expose to users over CLI vs over the SDK API
      2. Are there any specific best practices that should be followed while implementing LLMCompressor?
      3. Should LLMCOmpressor consume the InstructLab fine-tuning classes or maintain its own fine-tuning flow?
        • How will this impact the support of accelerators for different vendors

      Customer Considerations:
      1. An SME user should be able to easily benefit from LLMCompressor.

              wcabanba@redhat.com William Caban
              wcabanba@redhat.com William Caban
              Jehlum Vitasta Pandit, Mustafa Eyceoz, Oleg Silkin, Rob Greenberg, Tushar Katarki
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: