Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4113

Liger Kernel has Incorrect Loss [Permanent Fix]

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • rhelai-1.5
    • InstructLab - Training
    • None

      To Reproduce Steps to reproduce the behavior:

      1. Overfit the new granite-3.1-starter-v2 model with and without liger on a dataset consisting of a single tool call sample, where the first assistant token to be predicted is `<|tool_call|>`. You can create this dataset by running this:

       

       

      # can the model overfit with a token?
      messages = [
          {
              "content": "If Alan Turing was alive today, what programming language would he use?",
              "role": "user"
          },
          {
              # <|tool_call|> here is just for example as it's a hard token for the model to learn compared to others
              "content": "<|tool_call|>{\"answer\": \"Java\", \"explanation\": \"If Alan Turing was around today, he would most likely code using the Java programming language due to its robust, object-oriented nature. This makes it very difficult to go beyond the boundaries of how the language was intended to be used, and enables there to be very few correct solutions. Therefore Alan Turing would have used Java.\"}",
              "role": "assistant"
          }
      ]
      my_dataset = [{"messages": messages}] * 2000 

       

       

       

      1. Run this script to compare each sample and see what the predictions are:
      messages = [{
          "role": "user",
          "content": "<your tool call sample input here>"
      }]
      
      text = tokenizer.apply_chat_template(messages, tokenize=False) 
      text += "<|start_of_role|>assistant<|end_of_role|>"
      inputs = tokenizer.encode(text)
      
      inputs = tokenizer.encode(text, return_tensors="pt")
      
      output = first_ckpt.generate(inputs, return_dict_in_generate=True, output_scores=True)# inputs.shape, output.shape
      
      print(''.join(tokenizer.batch_decode(output.sequences[0])))greedy_token = torch.argmax(output.scores[0][0])
      correct_token = 49154
      print("-----")
      print(f"selected token: {greedy_token} ('{tokenizer.decode(greedy_token)}'): {output.scores[0][0][greedy_token]}")
      if greedy_token != correct_token:
      
          # get the position of the correct token
          sorted_logits = torch.argsort(output.scores[0][0])
          sorted_scores = torch.argsort(output.scores[0][0], descending=True)
          places = torch.arange(sorted_scores.shape[0])
          place_of_target = places[sorted_scores == correct_token].item()  
        print(f"Correct token: {correct_token} ('{tokenizer.decode(correct_token)}'): {output.scores[0][0][correct_token]} [{place_of_target} positions away from being picked]")
      else:
      
          # figure out what the next likely token was 
          sorted_logits = torch.argsort(output.scores[0][0])
          second_likeliest = sorted_logits[1]
          print(f"model produced the correct token ✅, next likeliest: {second_likeliest} ('{tokenizer.decode(second_likeliest)}'): {output.scores[0][0][second_likeliest]}") 

      You should expect to see that the model without Liger is able to accurately learn and reliably predict the `<|tool_call|>` token, whereas the Liger model will struggle to properly learn to predict this association. 

       

      Expected behavior

      • Tokens should have equal likelihood

      Screenshots

      • Attached Image

      Device Info (please complete the following information):

      • Hardware Specs: Any GPU node will do
      • OS Version: CentoOS Stream 9

      Bug impact

      • This bug impacts training configurations in RHEL AI 1.5 as all of the NVIDIA configs have been switched over to use Liger

      Known workaround

      • Disable Liger kernel

      Additional context

      • Liger changes the loss calculation under the hood which conflicts with how the training library expects to be able to calculate the loss
      • Because the training library handles much larger batches and requires gradient accumulation, we sum up the loss across nodes instead of averaging it immediately. We need to reconcile this in order to replicate the same loss

              rh-ee-akshirsa Atharva Kshirsagar
              osilkin@redhat.com Oleg Silkin
              Atharva Kshirsagar, Charles Doern, Mustafa Eyceoz
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: