Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@shenxiangzhuang
Copy link
Collaborator

No description provided.

@shenxiangzhuang shenxiangzhuang marked this pull request as draft November 25, 2025 11:50
@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.30%. Comparing base (aa0526e) to head (b5b2f13).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #63   +/-   ##
=======================================
  Coverage   95.30%   95.30%           
=======================================
  Files          10       10           
  Lines         618      618           
=======================================
  Hits          589      589           
  Misses         29       29           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shenxiangzhuang shenxiangzhuang marked this pull request as ready for review December 8, 2025 05:03
@shenxiangzhuang shenxiangzhuang self-assigned this Dec 8, 2025
@shenxiangzhuang shenxiangzhuang added the enhancement New feature or request label Dec 8, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-of-sequence (<eos>) token support to the GPT tokenizer and ensures it's properly appended to tokenized sequences during dataset processing.

Key Changes:

  • Added <eos> to the list of special tokens in GPT configuration
  • Modified text chunking logic to append <eos> token after each document and pad incomplete chunks
  • Added comprehensive tests to verify eos token insertion and padding behavior

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
toynlp/gpt/config.py Added <eos> to the special_tokens list to register it with the tokenizer
toynlp/gpt/dataset.py Refactored chunking logic to append eos tokens after each text, pad incomplete chunks, and validate required special tokens exist
tests/test_gpt_dataset.py Added new test file with tests verifying eos token insertion and padding behavior for chunked sequences

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shenxiangzhuang shenxiangzhuang marked this pull request as draft December 8, 2025 05:12
@shenxiangzhuang shenxiangzhuang marked this pull request as ready for review December 22, 2025 01:09
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shenxiangzhuang shenxiangzhuang merged commit ee3e99f into master Dec 22, 2025
3 checks passed
@shenxiangzhuang shenxiangzhuang deleted the fix/gpt_tokenize branch December 22, 2025 10:33
shenxiangzhuang added a commit that referenced this pull request Dec 22, 2025
* fix(tokenizer): add <eos> in tokenizer and sequences
* update training result
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants