Replace Certain Unicode Characters for the input #23

lolipopshock · 2022-07-06T22:34:09Z

HF tokenizer will replace certain unicode characters with a space ' '. Therefore, the token-level prediction will become shorter than the input, which can cause mis-matched sequences. This PR tries to fix this issue via enabling replacing such unicode characters with the [UNK] tokens. We replace unicode characters in certain "categories", namely, ["Cc", "Cf", "Co", "Cs", "Mn", "Zl", "Zp", "Zs"], as specified by the rules in the corresponding HF tokenizer:

Usage:

df_predictor.predict(pdf_data, page_size, replace_empty_unicode=False)

A future update could be just replacing the unicode characters in the cached file examples/find-empty-unicode-chars/zero-length-unicode-chars.txt, which we've tested and confirmed that has zero tokenization lengths.

lolipopshock · 2022-07-07T00:34:03Z

replace_bad_unicode_with_unk

lolipopshock added 6 commits July 6, 2022 15:19

allow setting return_type in predict_page

5509e3b

replace certain unicode characters

744b995

Update the unicode-char-searching examples

8d45f17

rename char file

0b25402

Add tests

701ab47

Add arguments for replace_empty_unicode

683713f

lolipopshock merged commit a593f1a into master Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace Certain Unicode Characters for the input #23

Replace Certain Unicode Characters for the input #23

Uh oh!

lolipopshock commented Jul 6, 2022 •

edited

Loading

Uh oh!

lolipopshock commented Jul 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replace Certain Unicode Characters for the input #23

Replace Certain Unicode Characters for the input #23

Uh oh!

Conversation

lolipopshock commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lolipopshock commented Jul 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lolipopshock commented Jul 6, 2022 •

edited

Loading