-
Feature
-
Resolution: Unresolved
-
Normal
-
None
-
False
-
-
False
-
Not Selected
Feature Overview (mandatory - Complete while in New status)
Liger Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.
IBM Cloud is looking for improvements to training time overhead in their SaaS. Liger kernels (https://github.com/linkedin/Liger-Kernel) have been shown to improve performance of training workloads by 20-30% as a result of significant memory reduction and GPU throughput improvement.
Goals
RHEL AI users see improved training times out of the box as a result of Liger kernels
Requirements:
- Create an option to enable / disable the use of Liger kernels
- Provided experimental tests show improvement and there are no side effects, set the default of that option to true
Done - Acceptance Criteria:
An option for enabling Liger kernels is present in the ilab config file
Use Cases - i.e. User Experience & Workflow:
(Initial completion while in Refinement status):
n/a
Out of Scope _{}(Initial completion while in Refinement status):{_}
n/a
Documentation Considerations _{}(Initial completion while in Refinement status):{_}
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation..
In the ideal scenario, this is a setting that a user would not need to worry about. However reference documentation listing it and what it does would be appropriate
Questions to Answer _{}(Initial completion while in Refinement status):{_}
What gains can we expect to see with this optimization?
Background and Strategic Fit (Initial completion while in Refinement status):
Improving performance and bringing down training times improves the experience for our customers.
Customer Considerations _{}(Initial completion while in Refinement status):{_}
n/a