[RHELAI-3102] Add liger kernels to training loop

Type: Feature
Resolution: Unresolved
Priority: Normal
Fix Version/s: rhelai-1.5
Affects Version/s: None
Component/s: InstructLab - Training, Performance & Scale
Labels:
- 1.5-candidate

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Feature Overview (mandatory - Complete while in New status)

Liger Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.

IBM Cloud is looking for improvements to training time overhead in their SaaS. Liger kernels (https://github.com/linkedin/Liger-Kernel) have been shown to improve performance of training workloads by 20-30% as a result of significant memory reduction and GPU throughput improvement.

Goals
RHEL AI users see improved training times out of the box as a result of Liger kernels

Requirements:

Create an option to enable / disable the use of Liger kernels
Provided experimental tests show improvement and there are no side effects, set the default of that option to true

Done - Acceptance Criteria:

An option for enabling Liger kernels is present in the ilab config file

Use Cases - i.e. User Experience & Workflow:

(Initial completion while in Refinement status):
n/a

Out of Scope _{}(Initial completion while in Refinement status):{_}
n/a

Documentation Considerations _{}(Initial completion while in Refinement status):{_}
~~Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation..~~

~~In the ideal scenario, this is a setting that a user would not need to worry about. However reference documentation listing it and what it does would be appropriate~~

Questions to Answer _{}(Initial completion while in Refinement status):{_}
~~What gains can we expect to see with this optimization?~~

~~Background and Strategic Fit (Initial completion while in Refinement status):~~
~~Improving performance and bringing down training times improves the experience for our customers.~~

Customer Considerations _{}(Initial completion while in Refinement status):{_}
~~n/a~~

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screenshot 2025-03-03 at 4.10.45 PM.png
69 kB
2025/03/04 12:23 AM

James Kunstle added a comment - 2025/03/04 12:23 AM - edited

Showing loss parity between the two implementations:

This was when running on the `train_all_pruned_SDG.jsonl` data in the training repo.

Having loss parity is great because we can work on optimizing for higher batch sizes and therefore throughput.

James Kunstle added a comment - 2025/03/04 12:23 AM - edited Showing loss parity between the two implementations: This was when running on the `train_all_pruned_SDG.jsonl` data in the training repo. Having loss parity is great because we can work on optimizing for higher batch sizes and therefore throughput.

James Kunstle added a comment - 2025/02/27 7:11 PM

I ran an experiment w/ liger kernels enabled in the `instructlab/training` repo. Observed loss parity between a control and an experimental run- kernels are mathematically correct for Granite.

Now profiling memory improvement from liger kernel addition which will allow us to increase batch size and achieve higher throughput.

James Kunstle added a comment - 2025/02/27 7:11 PM I ran an experiment w/ liger kernels enabled in the `instructlab/training` repo. Observed loss parity between a control and an experimental run- kernels are mathematically correct for Granite. Now profiling memory improvement from liger kernel addition which will allow us to increase batch size and achieve higher throughput.

James Kunstle added a comment - 2025/02/25 12:59 AM

I just got Granite support merged into the Liger Kernels codebase- integration into our code will happen after I’ve done some convergence tests

James Kunstle added a comment - 2025/02/25 12:59 AM I just got Granite support merged into the Liger Kernels codebase- integration into our code will happen after I’ve done some convergence tests

William Caban added a comment - 2025/02/25 12:48 AM

jgreene@redhat.com rhn-support-jkunstle should I assume this work is no longer happening?

William Caban added a comment - 2025/02/25 12:48 AM jgreene@redhat.com rhn-support-jkunstle should I assume this work is no longer happening?

James Kunstle added a comment - 2025/02/18 11:51 PM

Upstream contributor helped solve problem- convergence test model configuration was missing the GraniteConfig `logit_scaling` value required for outputs to be correct. PR has been rebased and is waiting for CI run.

James Kunstle added a comment - 2025/02/18 11:51 PM Upstream contributor helped solve problem- convergence test model configuration was missing the GraniteConfig `logit_scaling` value required for outputs to be correct. PR has been rebased and is waiting for CI run.

James Kunstle added a comment - 2025/02/05 3:42 AM

Opened a PR here implementing a first pass of Liger Kernel support for Granite upstream.

Currently having some trouble because SwiGLU isn't logit-equivalent for Granite even in their testing suite.

I took some notes comparing the Llama architecture vs. Granite. All comparisons are against Llama{module}:
- GraniteModel: everything is the same except for `self.embedding_multiplied` from config that scales the input embeds.
- GranitePretrainedModel: identical
- GraniteDecoderLayer: main diff is `residual_multiplier` applied to hidden_states during residual connection post-attention and post-FC layer.
- GraniteMLP: identical
- GraniteRMSNorm: identical
- GraniteRotaryEmbedding: identical
- GraniteAttention: nearly identical; granite has a slightly different attention multiplier from config
- eager_attention_forward: identical
- GraniteForCausalLM: only major difference is `logits = logits / self.config.logits_scaling`
- LigerSwiGLUMLP vs. GraniteMLP: very similar, just calling a Silu triton kernel instead of directly calling the nn.Silu activation function
- LigerRMSNormFunction vs. GraniteRMSNorm: not meaningfully different. Calls llama casting mode by default.

James Kunstle added a comment - 2025/02/05 3:42 AM Opened a PR here implementing a first pass of Liger Kernel support for Granite upstream. Currently having some trouble because SwiGLU isn't logit-equivalent for Granite even in their testing suite. I took some notes comparing the Llama architecture vs. Granite. All comparisons are against Llama{module}: - GraniteModel: everything is the same except for `self.embedding_multiplied` from config that scales the input embeds. - GranitePretrainedModel: identical - GraniteDecoderLayer: main diff is `residual_multiplier` applied to hidden_states during residual connection post-attention and post-FC layer. - GraniteMLP: identical - GraniteRMSNorm: identical - GraniteRotaryEmbedding: identical - GraniteAttention: nearly identical; granite has a slightly different attention multiplier from config - eager_attention_forward: identical - GraniteForCausalLM: only major difference is `logits = logits / self.config.logits_scaling` - LigerSwiGLUMLP vs. GraniteMLP: very similar, just calling a Silu triton kernel instead of directly calling the nn.Silu activation function - LigerRMSNormFunction vs. GraniteRMSNorm: not meaningfully different. Calls llama casting mode by default.

James Kunstle added a comment - 2025/02/01 1:08 AM - edited

SwiGLU not being logit-equivalent is common between Llama and Granite models.

import torch
import transformers
from liger_kernel.transformers import apply_liger_kernel_to_llama
def get_model():
    return transformers.AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8b-Instruct", attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16).to(0)def get_data():
    return torch.zeros(1, 4096, dtype=torch.int).to(0)def run_forward(model, x):
    out = None
    with torch.no_grad():
        out = model(input_ids=x)    return outx = get_data()
model = get_model()
y_before = run_forward(model, x)
del model
del x
torch.cuda.empty_cache()

apply_liger_kernel_to_llama(
        rope=True,
        swiglu=False,
        cross_entropy=True,
        fused_linear_cross_entropy=False,
        rms_norm=False
        )x = get_data()
model = get_model()
y_after = run_forward(model, x)print(y_before)
print(y_after)
assert torch.allclose(y_before.logits, y_after.logits) # fails if swiglu is True

James Kunstle added a comment - 2025/02/01 1:08 AM - edited SwiGLU not being logit-equivalent is common between Llama and Granite models. import torch import transformers from liger_kernel.transformers import apply_liger_kernel_to_llama def get_model(): return transformers.AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8b-Instruct" , attn_implementation= "flash_attention_2" , torch_dtype=torch.bfloat16).to(0)def get_data(): return torch.zeros(1, 4096, dtype=torch. int ).to(0)def run_forward(model, x): out = None with torch.no_grad(): out = model(input_ids=x) return outx = get_data() model = get_model() y_before = run_forward(model, x) del model del x torch.cuda.empty_cache() apply_liger_kernel_to_llama( rope=True, swiglu=False, cross_entropy=True, fused_linear_cross_entropy=False, rms_norm=False )x = get_data() model = get_model() y_after = run_forward(model, x)print(y_before) print(y_after) assert torch.allclose(y_before.logits, y_after.logits) # fails if swiglu is True

James Kunstle added a comment - 2025/02/01 12:51 AM

Finished a quick preliminary investigation.

Granite 3.0 and 3.1 models are nearly identical to Llama 3.0, 3.1 models, so we can reuse the code that `liger_kernel` uses to swap ops in the Granite 3.x models.

Currently, swapping RMSNorm and ROPE yields logit-identical outputs w/ a ~5% speedup on a very short forward-only benchmark. Adding SwiGLUMLP layer improves the speedup to ~8% but isn't logit-equivalent. I've read through the code differences between Granite and Llama, and native implementation isn't different, so there shouldn't be a difference between the two when the layer is swapped. To be investigated.

James Kunstle added a comment - 2025/02/01 12:51 AM Finished a quick preliminary investigation. Granite 3.0 and 3.1 models are nearly identical to Llama 3.0, 3.1 models, so we can reuse the code that `liger_kernel` uses to swap ops in the Granite 3.x models. Currently, swapping RMSNorm and ROPE yields logit-identical outputs w/ a ~5% speedup on a very short forward-only benchmark. Adding SwiGLUMLP layer improves the speedup to ~8% but isn't logit-equivalent. I've read through the code differences between Granite and Llama, and native implementation isn't different, so there shouldn't be a difference between the two when the layer is swapped. To be investigated.

William Caban added a comment - 2025/01/30 2:09 AM

After discussing the improvements this brings to the training cycle with rhn-support-jkunstle, I agree to bring this as a RHEL AI 1.5 deliverable.

William Caban added a comment - 2025/01/30 2:09 AM After discussing the improvements this brings to the training cycle with rhn-support-jkunstle , I agree to bring this as a RHEL AI 1.5 deliverable.

James Kunstle added a comment - 2025/01/30 12:12 AM - edited

A few notes:

Fused kernels should be drop-in replacements for a series of operations in a model's architecture. Generally, they should be mathematically equivalent, yielding: ```f = y = g(h(j(k)))```
The liger kernel repository reports support for cuda and ROCm, being a Triton kernel implementation. This won't work for Gaudi.
From looking at the HF model implementations, Granite 3.x and Llama 3.y seem to be very similar architecturally. The Liger kernel repo doesn't report support for Granite 3.x models but most of the work is hopefully already done w/ Llama 3.y support.

James Kunstle added a comment - 2025/01/30 12:12 AM - edited A few notes: Fused kernels should be drop-in replacements for a series of operations in a model's architecture. Generally, they should be mathematically equivalent, yielding: ```f = y = g(h(j(k )))``` The liger kernel repository reports support for cuda and ROCm, being a Triton kernel implementation. This won't work for Gaudi. From looking at the HF model implementations, Granite 3.x and Llama 3.y seem to be very similar architecturally. The Liger kernel repo doesn't report support for Granite 3.x models but most of the work is hopefully already done w/ Llama 3.y support.

Assignee:: William Caban

Reporter:: Jason Greene

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/01/24 2:18 AM

Updated:: 2025/03/11 8:02 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

Collapse comment: James Kunstle added a comment - 2025/03/04 12:23 AM, Edited by James Kunstle - 2025/03/04 12:24 AM

Expand comment: James Kunstle added a comment - 2025/03/04 12:23 AM, Edited by James Kunstle - 2025/03/04 12:24 AM

Collapse comment: James Kunstle added a comment - 2025/02/27 7:11 PM

Expand comment: James Kunstle added a comment - 2025/02/27 7:11 PM

Collapse comment: James Kunstle added a comment - 2025/02/25 12:59 AM

Expand comment: James Kunstle added a comment - 2025/02/25 12:59 AM

Collapse comment: William Caban added a comment - 2025/02/25 12:48 AM

Expand comment: William Caban added a comment - 2025/02/25 12:48 AM

Collapse comment: James Kunstle added a comment - 2025/02/18 11:51 PM

Expand comment: James Kunstle added a comment - 2025/02/18 11:51 PM

Collapse comment: James Kunstle added a comment - 2025/02/05 3:42 AM

Expand comment: James Kunstle added a comment - 2025/02/05 3:42 AM

Collapse comment: James Kunstle added a comment - 2025/02/01 1:08 AM, Edited by James Kunstle - 2025/02/01 1:10 AM

Expand comment: James Kunstle added a comment - 2025/02/01 1:08 AM, Edited by James Kunstle - 2025/02/01 1:10 AM

Collapse comment: James Kunstle added a comment - 2025/02/01 12:51 AM

Expand comment: James Kunstle added a comment - 2025/02/01 12:51 AM

Collapse comment: William Caban added a comment - 2025/01/30 2:09 AM

Expand comment: William Caban added a comment - 2025/01/30 2:09 AM

Collapse comment: James Kunstle added a comment - 2025/01/30 12:12 AM, Edited by James Kunstle - 2025/01/30 12:13 AM

Expand comment: James Kunstle added a comment - 2025/01/30 12:12 AM, Edited by James Kunstle - 2025/01/30 12:13 AM

People

Dates