Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@geoffreyangus
Copy link
Contributor

Prior to this change, we used pad token at the end of target tensor. This was okay because many of the new LLMs trained with pad token == eos token. With Gemma, there is a separate eos token. The issue now is that, during generation, Gemma cannot produce an eos token, so generation never stops. We now use eos token during fine-tuning so that LLMs are guaranteed to learn how to stop during the generation step.

input_id_sample_no_padding = remove_left_padding(input_id_sample, tokenizer)[0]
target_id_sample_no_padding = remove_left_padding(target_id_sample, tokenizer)[0]
target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, pad_tensor), dim=-1)
target_id_sample_no_padding = torch.cat((target_id_sample_no_padding, eos_tensor), dim=-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoffreyangus Just for my edification. Is PAD token not needed at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexsherstinsky It should always be EOS token! For most models, they don't have a pad token so we set pad to eos token and were appending "pad token" which is basically EOS token. But for models with tokenizers where "eos" token and "pad" token are both already present and different, this will be wrong since we're always supposed to be appending an eos token at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- indeed, all my notebooks have tokenizer.pad_token = tokenizer.eos_token :-). -- thanks for the clarification!

@github-actions
Copy link

github-actions bot commented Feb 27, 2024

Unit Test Results

  4 files  ±       0    4 suites  ±0   9m 29s ⏱️ - 17m 37s
12 tests  - 2 972    9 ✔️  - 2 962    3 💤  - 9  0  - 1 
40 runs   - 2 960  28 ✔️  - 2 953  12 💤  - 6  0  - 1 

Results for commit eaac1e4. ± Comparison against base commit d347063.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@Infernaught Infernaught left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left my comments, but otherwise LGTM!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be 612 or 621? And did you intend to leave these print statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's now 621– token count per epoch was incremented by 1 because we replaced all final PAD tokens with EOS tokens (PAD tokens are ignored by accounting: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/accounting/used_tokens.py#L55)

Copy link
Collaborator

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@geoffreyangus geoffreyangus merged commit 021a099 into master Feb 27, 2024
@geoffreyangus geoffreyangus deleted the fix-eos-token branch February 27, 2024 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants