-
Notifications
You must be signed in to change notification settings - Fork 5
fix(tokenizer): add <eos> in tokenizer and sequences #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #63 +/- ##
=======================================
Coverage 95.30% 95.30%
=======================================
Files 10 10
Lines 618 618
=======================================
Hits 589 589
Misses 29 29 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds end-of-sequence (<eos>) token support to the GPT tokenizer and ensures it's properly appended to tokenized sequences during dataset processing.
Key Changes:
- Added
<eos>to the list of special tokens in GPT configuration - Modified text chunking logic to append
<eos>token after each document and pad incomplete chunks - Added comprehensive tests to verify eos token insertion and padding behavior
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| toynlp/gpt/config.py | Added <eos> to the special_tokens list to register it with the tokenizer |
| toynlp/gpt/dataset.py | Refactored chunking logic to append eos tokens after each text, pad incomplete chunks, and validate required special tokens exist |
| tests/test_gpt_dataset.py | Added new test file with tests verifying eos token insertion and padding behavior for chunked sequences |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* fix(tokenizer): add <eos> in tokenizer and sequences * update training result
No description provided.