Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: rhelai-1.5
Component/s: InstructLab - Training
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Git Pull Request:
https://github.com/instructlab/training/pull/609
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Overfit the new granite-3.1-starter-v2 model with and without liger on a dataset consisting of a single tool call sample, where the first assistant token to be predicted is `<|tool_call|>`. You can create this dataset by running this:

# can the model overfit with a token?
messages = [
    {
        "content": "If Alan Turing was alive today, what programming language would he use?",
        "role": "user"
    },
    {
        # <|tool_call|> here is just for example as it's a hard token for the model to learn compared to others
        "content": "<|tool_call|>{\"answer\": \"Java\", \"explanation\": \"If Alan Turing was around today, he would most likely code using the Java programming language due to its robust, object-oriented nature. This makes it very difficult to go beyond the boundaries of how the language was intended to be used, and enables there to be very few correct solutions. Therefore Alan Turing would have used Java.\"}",
        "role": "assistant"
    }
]
my_dataset = [{"messages": messages}] * 2000

Run this script to compare each sample and see what the predictions are:

messages = [{
    "role": "user",
    "content": "<your tool call sample input here>"
}]

text = tokenizer.apply_chat_template(messages, tokenize=False) 
text += "<|start_of_role|>assistant<|end_of_role|>"
inputs = tokenizer.encode(text)

inputs = tokenizer.encode(text, return_tensors="pt")

output = first_ckpt.generate(inputs, return_dict_in_generate=True, output_scores=True)# inputs.shape, output.shape

print(''.join(tokenizer.batch_decode(output.sequences[0])))greedy_token = torch.argmax(output.scores[0][0])
correct_token = 49154
print("-----")
print(f"selected token: {greedy_token} ('{tokenizer.decode(greedy_token)}'): {output.scores[0][0][greedy_token]}")
if greedy_token != correct_token:

    # get the position of the correct token
    sorted_logits = torch.argsort(output.scores[0][0])
    sorted_scores = torch.argsort(output.scores[0][0], descending=True)
    places = torch.arange(sorted_scores.shape[0])
    place_of_target = places[sorted_scores == correct_token].item()  
  print(f"Correct token: {correct_token} ('{tokenizer.decode(correct_token)}'): {output.scores[0][0][correct_token]} [{place_of_target} positions away from being picked]")
else:

    # figure out what the next likely token was 
    sorted_logits = torch.argsort(output.scores[0][0])
    second_likeliest = sorted_logits[1]
    print(f"model produced the correct token ✅, next likeliest: {second_likeliest} ('{tokenizer.decode(second_likeliest)}'): {output.scores[0][0][second_likeliest]}")

You should expect to see that the model without Liger is able to accurately learn and reliably predict the `<|tool_call|>` token, whereas the Liger model will struggle to properly learn to predict this association.

Expected behavior

Tokens should have equal likelihood

Screenshots

Attached Image

Device Info (please complete the following information):

Hardware Specs: Any GPU node will do
OS Version: CentoOS Stream 9

Bug impact

This bug impacts training configurations in RHEL AI 1.5 as all of the NVIDIA configs have been switched over to use Liger

Known workaround

Disable Liger kernel

Additional context

Liger changes the loss calculation under the hood which conflicts with how the training library expects to be able to calculate the loss
Because the training library handles much larger batches and requires gradient accumulation, we sum up the loss across nodes instead of averaging it immediately. We need to reconcile this in order to replicate the same loss

causes

RHELAI-4107 granite-3.1-8b-lab-v2 has degraded responses

Verified

clones

RHELAI-4057 Liger Kernel has Incorrect Loss [Workaround]

Closed

depends on

RHELAI-4072 validate RHEL AI 1.5 GPTDolomite + Torch 2.6.0 + Python3.11

Closed

mentioned on

Merge request - RHELAI-4057: disable LK due to performance

Assignee:: Atharva Kshirsagar

Reporter:: Oleg Silkin

Contributors:: Atharva Kshirsagar, Charles Doern, Mustafa Eyceoz

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/05/08 2:19 PM

Updated:: 2025/06/26 2:15 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates