fix: use eos token in target tensor for instruction-tuning #3945

geoffreyangus · 2024-02-27T17:50:36Z

Prior to this change, we used pad token at the end of target tensor. This was okay because many of the new LLMs trained with pad token == eos token. With Gemma, there is a separate eos token. The issue now is that, during generation, Gemma cannot produce an eos token, so generation never stops. We now use eos token during fine-tuning so that LLMs are guaranteed to learn how to stop during the generation step.

alexsherstinsky · 2024-02-27T17:55:13Z

ludwig/utils/llm_utils.py

        input_id_sample_no_padding = remove_left_padding(input_id_sample, tokenizer)[0]
        target_id_sample_no_padding = remove_left_padding(target_id_sample, tokenizer)[0]
-        target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, pad_tensor), dim=-1)
+        target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, eos_tensor), dim=-1)


@geoffreyangus Just for my edification. Is PAD token not needed at all?

@alexsherstinsky It should always be EOS token! For most models, they don't have a pad token so we set pad to eos token and were appending "pad token" which is basically EOS token. But for models with tokenizers where "eos" token and "pad" token are both already present and different, this will be wrong since we're always supposed to be appending an eos token at the end

that's correct!

Got it -- indeed, all my notebooks have tokenizer.pad_token = tokenizer.eos_token :-). -- thanks for the clarification!

github-actions · 2024-02-27T18:20:30Z

Unit Test Results

  4 files ±      0   4 suites ±0 9m 29s ⏱️ - 17m 37s
12 tests - 2 972   9 ✔️ - 2 962   3 💤 - 9 0 ❌ - 1
40 runs - 2 960 28 ✔️ - 2 953 12 💤 - 6 0 ❌ - 1

Results for commit eaac1e4. ± Comparison against base commit d347063.

♻️ This comment has been updated with latest results.

Infernaught

Left my comments, but otherwise LGTM!

Infernaught · 2024-02-27T20:37:34Z

tests/integration_tests/test_llm.py

Is this supposed to be 612 or 621? And did you intend to leave these print statements?

it's now 621– token count per epoch was incremented by 1 because we replaced all final PAD tokens with EOS tokens (PAD tokens are ignored by accounting: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/accounting/used_tokens.py#L55)

alexsherstinsky

LGTM! Thanks!

fix: use eos token in target tensor for instruction-tuning

19d4c1f

geoffreyangus requested review from Infernaught, alexsherstinsky, arnavgarg1, jeffkinnison, justinxzhao, tgaddair and w4nderlust as code owners February 27, 2024 17:50

alexsherstinsky reviewed Feb 27, 2024

View reviewed changes

fix tests

f9483c9

Infernaught approved these changes Feb 27, 2024

View reviewed changes

pr revision

bc47e33

alexsherstinsky approved these changes Feb 27, 2024

View reviewed changes

justinxzhao approved these changes Feb 27, 2024

View reviewed changes

fix tests

eaac1e4

geoffreyangus merged commit 021a099 into master Feb 27, 2024

geoffreyangus deleted the fix-eos-token branch February 27, 2024 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use eos token in target tensor for instruction-tuning #3945

fix: use eos token in target tensor for instruction-tuning #3945

Uh oh!

geoffreyangus commented Feb 27, 2024

Uh oh!

alexsherstinsky Feb 27, 2024

Uh oh!

arnavgarg1 Feb 27, 2024

Uh oh!

geoffreyangus Feb 27, 2024

Uh oh!

alexsherstinsky Feb 27, 2024

Uh oh!

github-actions bot commented Feb 27, 2024 •

edited

Loading

Uh oh!

Infernaught left a comment

Uh oh!

Infernaught Feb 27, 2024

Uh oh!

geoffreyangus Feb 27, 2024

Uh oh!

alexsherstinsky left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fix: use eos token in target tensor for instruction-tuning #3945

fix: use eos token in target tensor for instruction-tuning #3945

Uh oh!

Conversation

geoffreyangus commented Feb 27, 2024

Uh oh!

alexsherstinsky Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

arnavgarg1 Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

geoffreyangus Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

alexsherstinsky Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Infernaught left a comment

Choose a reason for hiding this comment

Uh oh!

Infernaught Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

geoffreyangus Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

alexsherstinsky left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

github-actions bot commented Feb 27, 2024 •

edited

Loading