-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
rhelai-1.5
-
None
-
False
-
-
False
-
-
To Reproduce Steps to reproduce the behavior:
- Overfit the new granite-3.1-starter-v2 model with and without liger on a dataset consisting of a single tool call sample, where the first assistant token to be predicted is `<|tool_call|>`. You can create this dataset by running this:
# can the model overfit with a token? messages = [ { "content": "If Alan Turing was alive today, what programming language would he use?", "role": "user" }, { # <|tool_call|> here is just for example as it's a hard token for the model to learn compared to others "content": "<|tool_call|>{\"answer\": \"Java\", \"explanation\": \"If Alan Turing was around today, he would most likely code using the Java programming language due to its robust, object-oriented nature. This makes it very difficult to go beyond the boundaries of how the language was intended to be used, and enables there to be very few correct solutions. Therefore Alan Turing would have used Java.\"}", "role": "assistant" } ] my_dataset = [{"messages": messages}] * 2000
- Run this script to compare each sample and see what the predictions are:
messages = [{ "role": "user", "content": "<your tool call sample input here>" }] text = tokenizer.apply_chat_template(messages, tokenize=False) text += "<|start_of_role|>assistant<|end_of_role|>" inputs = tokenizer.encode(text) inputs = tokenizer.encode(text, return_tensors="pt") output = first_ckpt.generate(inputs, return_dict_in_generate=True, output_scores=True)# inputs.shape, output.shape print(''.join(tokenizer.batch_decode(output.sequences[0])))greedy_token = torch.argmax(output.scores[0][0]) correct_token = 49154 print("-----") print(f"selected token: {greedy_token} ('{tokenizer.decode(greedy_token)}'): {output.scores[0][0][greedy_token]}") if greedy_token != correct_token: # get the position of the correct token sorted_logits = torch.argsort(output.scores[0][0]) sorted_scores = torch.argsort(output.scores[0][0], descending=True) places = torch.arange(sorted_scores.shape[0]) place_of_target = places[sorted_scores == correct_token].item() print(f"Correct token: {correct_token} ('{tokenizer.decode(correct_token)}'): {output.scores[0][0][correct_token]} [{place_of_target} positions away from being picked]") else: # figure out what the next likely token was sorted_logits = torch.argsort(output.scores[0][0]) second_likeliest = sorted_logits[1] print(f"model produced the correct token ✅, next likeliest: {second_likeliest} ('{tokenizer.decode(second_likeliest)}'): {output.scores[0][0][second_likeliest]}")
You should expect to see that the model without Liger is able to accurately learn and reliably predict the `<|tool_call|>` token, whereas the Liger model will struggle to properly learn to predict this association.
Expected behavior
- Tokens should have equal likelihood
Screenshots
- Attached Image
Device Info (please complete the following information):
- Hardware Specs: Any GPU node will do
- OS Version: CentoOS Stream 9
Bug impact
- This bug impacts training configurations in RHEL AI 1.5 as all of the NVIDIA configs have been switched over to use Liger
Known workaround
- Disable Liger kernel
Additional context
- Liger changes the loss calculation under the hood which conflicts with how the training library expects to be able to calculate the loss
- Because the training library handles much larger batches and requires gradient accumulation, we sum up the loss across nodes instead of averaging it immediately. We need to reconcile this in order to replicate the same loss
- causes
-
RHELAI-4107 granite-3.1-8b-lab-v2 has degraded responses
-
- Verified
-
- clones
-
RHELAI-4057 Liger Kernel has Incorrect Loss [Workaround]
-
- Closed
-
- depends on
-
RHELAI-4072 validate RHEL AI 1.5 GPTDolomite + Torch 2.6.0 + Python3.11
-
- Closed
-
- mentioned on