- 回顾bert的tokenizer,并实现分词后token到raw_text_char的映射关系。
- tokenization_bert.py 原始的bert分词方式
长度为500的样本,totokenizer.tokenize耗时2.3ms
tokenizer = FullTokenizer("vocab.txt") text = '哈哈,abn\u0303o' tokens = tokenizer.tokenize(text) #tokens:['哈', '哈', ',', 'ab', '##no'] - tokenization.py
长度为500的样本,totokenizer.tokenize耗时5.1ms
tokenizer = FullTokenizer("vocab.txt") text = '哈哈,abn\u0303o' tokens, index_map = tokenizer.tokenize(text) print(tokens, index_map) # tekens:['哈', '哈', ',', 'ab', '##no'] # index_map:[[0], [1], [2], [3, 4], [5, 7]]
-
Notifications
You must be signed in to change notification settings - Fork 0
kyang888/tokenization
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published