• False
    • Hide

      None

      Show
      None
    • False
    • Not Selected

      Feature Overview (mandatory - Complete while in New status)

      Liger Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.

      IBM Cloud is looking for improvements to training time overhead in their SaaS. Liger kernels (https://github.com/linkedin/Liger-Kernel) have been shown to improve performance of training workloads by 20-30% as a result of significant memory reduction and GPU throughput improvement. 

      Goals
      RHEL AI users see improved training times out of the box as a result of Liger kernels

      Requirements:

      • Create an option to enable / disable the use of Liger kernels
      • Provided experimental tests show improvement and there are no side effects, set the default of that option to true

      Done - Acceptance Criteria:

      An option for enabling Liger kernels is present in the ilab config file 

       

      Use Cases - i.e. User Experience & Workflow:

      (Initial completion while in Refinement status):
      n/a

      Out of Scope _{}(Initial completion while in Refinement status):{_}
      n/a

      Documentation Considerations _{}(Initial completion while in Refinement status):{_}
      Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation..

      In the ideal scenario, this is a setting that a user would not need to worry about. However reference documentation listing it and what it does would be appropriate

       

      Questions to Answer _{}(Initial completion while in Refinement status):{_}
      What gains can we expect to see with this optimization?

      Background and Strategic Fit (Initial completion while in Refinement status):
      Improving performance and bringing down training times improves the experience for our customers.

      Customer Considerations _{}(Initial completion while in Refinement status):{_}
      n/a

            [RHELAI-3102] Add liger kernels to training loop

            James Kunstle added a comment - - edited

            Showing loss parity between the two implementations:

             

            This was when running on the `train_all_pruned_SDG.jsonl` data in the training repo. 

            Having loss parity is great because we can work on optimizing for higher batch sizes and therefore throughput.

            James Kunstle added a comment - - edited Showing loss parity between the two implementations:   This was when running on the `train_all_pruned_SDG.jsonl` data in the training repo.  Having loss parity is great because we can work on optimizing for higher batch sizes and therefore throughput.

            I ran an experiment w/ liger kernels enabled in the `instructlab/training` repo. Observed loss parity between a control and an experimental run- kernels are mathematically correct for Granite. 

             

            Now profiling memory improvement from liger kernel addition which will allow us to increase batch size and achieve higher throughput.

            James Kunstle added a comment - I ran an experiment w/ liger kernels enabled in the `instructlab/training` repo. Observed loss parity between a control and an experimental run- kernels are mathematically correct for Granite.    Now profiling memory improvement from liger kernel addition which will allow us to increase batch size and achieve higher throughput.

            I just got Granite support merged into the Liger Kernels codebase- integration into our code will happen after I’ve done some convergence tests

            James Kunstle added a comment - I just got Granite support merged into the Liger Kernels codebase- integration into our code will happen after I’ve done some convergence tests

            jgreene@redhat.com rhn-support-jkunstle should I assume this work is no longer happening?

            William Caban added a comment - jgreene@redhat.com rhn-support-jkunstle should I assume this work is no longer happening?

            Upstream contributor helped solve problem- convergence test model configuration was missing the GraniteConfig `logit_scaling` value required for outputs to be correct. PR has been rebased and is waiting for CI run.

            James Kunstle added a comment - Upstream contributor helped solve problem- convergence test model configuration was missing the GraniteConfig `logit_scaling` value required for outputs to be correct. PR has been rebased and is waiting for CI run.

            Opened a PR here implementing a first pass of Liger Kernel support for Granite upstream.

            Currently having some trouble because SwiGLU isn't logit-equivalent for Granite even in their testing suite. 

             

            I took some notes comparing the Llama architecture vs. Granite. All comparisons are against Llama{module}:
                - GraniteModel: everything is the same except for `self.embedding_multiplied` from config that scales the input embeds.
                - GranitePretrainedModel: identical
                - GraniteDecoderLayer: main diff is `residual_multiplier` applied to hidden_states during residual connection post-attention and post-FC layer.
                - GraniteMLP: identical
                - GraniteRMSNorm: identical
                - GraniteRotaryEmbedding: identical
                - GraniteAttention: nearly identical; granite has a slightly different attention multiplier from config
                - eager_attention_forward: identical
                - GraniteForCausalLM: only major difference is `logits = logits / self.config.logits_scaling`
                - LigerSwiGLUMLP vs. GraniteMLP: very similar, just calling a Silu triton kernel instead of directly calling the nn.Silu activation function
                - LigerRMSNormFunction vs. GraniteRMSNorm: not meaningfully different. Calls llama casting mode by default.

            James Kunstle added a comment - Opened a PR here implementing a first pass of Liger Kernel support for Granite upstream. Currently having some trouble because SwiGLU isn't logit-equivalent for Granite even in their testing suite.    I took some notes comparing the Llama architecture vs. Granite. All comparisons are against Llama{module}:     - GraniteModel: everything is the same except for `self.embedding_multiplied` from config that scales the input embeds.     - GranitePretrainedModel: identical     - GraniteDecoderLayer: main diff is `residual_multiplier` applied to hidden_states during residual connection post-attention and post-FC layer.     - GraniteMLP: identical     - GraniteRMSNorm: identical     - GraniteRotaryEmbedding: identical     - GraniteAttention: nearly identical; granite has a slightly different attention multiplier from config     - eager_attention_forward: identical     - GraniteForCausalLM: only major difference is `logits = logits / self.config.logits_scaling`     - LigerSwiGLUMLP vs. GraniteMLP: very similar, just calling a Silu triton kernel instead of directly calling the nn.Silu activation function     - LigerRMSNormFunction vs. GraniteRMSNorm: not meaningfully different. Calls llama casting mode by default.

            James Kunstle added a comment - - edited

            SwiGLU not being logit-equivalent is common between Llama and Granite models.

             

            import torch
            import transformers
            from liger_kernel.transformers import apply_liger_kernel_to_llama
            def get_model():
                return transformers.AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8b-Instruct", attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16).to(0)def get_data():
                return torch.zeros(1, 4096, dtype=torch.int).to(0)def run_forward(model, x):
                out = None
                with torch.no_grad():
                    out = model(input_ids=x)    return outx = get_data()
            model = get_model()
            y_before = run_forward(model, x)
            del model
            del x
            torch.cuda.empty_cache()
            
            apply_liger_kernel_to_llama(
                    rope=True,
                    swiglu=False,
                    cross_entropy=True,
                    fused_linear_cross_entropy=False,
                    rms_norm=False
                    )x = get_data()
            model = get_model()
            y_after = run_forward(model, x)print(y_before)
            print(y_after)
            assert torch.allclose(y_before.logits, y_after.logits) # fails if swiglu is True

            James Kunstle added a comment - - edited SwiGLU not being logit-equivalent is common between Llama and Granite models.   import torch import transformers from liger_kernel.transformers import apply_liger_kernel_to_llama def get_model():     return transformers.AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8b-Instruct" , attn_implementation= "flash_attention_2" , torch_dtype=torch.bfloat16).to(0)def get_data():     return torch.zeros(1, 4096, dtype=torch. int ).to(0)def run_forward(model, x):     out = None     with torch.no_grad():         out = model(input_ids=x)    return outx = get_data() model = get_model() y_before = run_forward(model, x) del model del x torch.cuda.empty_cache() apply_liger_kernel_to_llama(         rope=True,         swiglu=False,         cross_entropy=True,         fused_linear_cross_entropy=False,         rms_norm=False         )x = get_data() model = get_model() y_after = run_forward(model, x)print(y_before) print(y_after) assert torch.allclose(y_before.logits, y_after.logits) # fails if swiglu is True

            Finished a quick preliminary investigation.

             

            Granite 3.0 and 3.1 models are nearly identical to Llama 3.0, 3.1 models, so we can reuse the code that `liger_kernel` uses to swap ops in the Granite 3.x models.

            Currently, swapping RMSNorm and ROPE yields logit-identical outputs w/ a ~5% speedup on a very short forward-only benchmark. Adding SwiGLUMLP layer improves the speedup to ~8% but isn't logit-equivalent. I've read through the code differences between Granite and Llama, and native implementation isn't different, so there shouldn't be a difference between the two when the layer is swapped. To be investigated.

            James Kunstle added a comment - Finished a quick preliminary investigation.   Granite 3.0 and 3.1 models are nearly identical to Llama 3.0, 3.1 models, so we can reuse the code that `liger_kernel` uses to swap ops in the Granite 3.x models. Currently, swapping RMSNorm and ROPE yields logit-identical outputs w/ a ~5% speedup on a very short forward-only benchmark. Adding SwiGLUMLP layer improves the speedup to ~8% but isn't logit-equivalent. I've read through the code differences between Granite and Llama, and native implementation isn't different, so there shouldn't be a difference between the two when the layer is swapped. To be investigated.

            After discussing the improvements this brings to the training cycle with rhn-support-jkunstle, I agree to bring this as a RHEL AI 1.5 deliverable.

            William Caban added a comment - After discussing the improvements this brings to the training cycle with rhn-support-jkunstle , I agree to bring this as a RHEL AI 1.5 deliverable.

            James Kunstle added a comment - - edited

            A few notes:

            • Fused kernels should be drop-in replacements for a series of operations in a model's architecture. Generally, they should be mathematically equivalent, yielding: ```f = y = g(h(j(k)))```
            • The liger kernel repository reports support for cuda and ROCm, being a Triton kernel implementation. This won't work for Gaudi.
            • From looking at the HF model implementations, Granite 3.x and Llama 3.y seem to be very similar architecturally. The Liger kernel repo doesn't report support for Granite 3.x models but most of the work is hopefully already done w/ Llama 3.y support.

            James Kunstle added a comment - - edited A few notes: Fused kernels should be drop-in replacements for a series of operations in a model's architecture. Generally, they should be mathematically equivalent, yielding: ```f  = y = g(h(j(k )))``` The liger kernel repository reports support for cuda and ROCm, being a Triton kernel implementation. This won't work for Gaudi. From looking at the HF model implementations, Granite 3.x and Llama 3.y seem to be very similar architecturally. The Liger kernel repo doesn't report support for Granite 3.x models but most of the work is hopefully already done w/ Llama 3.y support.

              wcabanba@redhat.com William Caban
              jgreene@redhat.com Jason Greene
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: