diff: Fix inline word diff to treat non-alpha multibyte chars as non-word #17050

ychin · 2025-04-04T08:53:16Z

Previously inline word diff simply used Vim's definition of keyword to determine what is a word, which leads to multi-byte character classes such as emojis and CJK (Chinese/Japanese/Korean) characters all classifying as word characters, leading to entire sentences being grouped as a single word which does not provide meaningful information in a diff highlight (since those characters are not necessarily separated by space).

Fix this by treating all non-alphanumeric characters (with class number above 2) as non-word characters, as there is usually no benefit in using word diff on them. These include CJK characters, emojis, and also subscript/superscript numbers. Meanwhile, multi-byte characters like Cyrillic and Greek letters will still continue to considered as words.

Note that this is slightly inconsistent with how words are defined elsewhere, as Vim usually considers any character with class >=2 to be a "word".

Related: #16881 (diff inline highlight)

…word Previously inline word diff simply used Vim's definition of keyword to determine what is a word, which leads to multi-byte character classes such as emojis and CJK (Chinese/Japanese/Korean) characters all classifying as word characters, leading to entire sentences being grouped as a single word which does not provide meaningful information in a diff highlight. Fix this by treating all non-alphanumeric characters (with class number above 2) as non-word characters, as there is usually no benefit in using word diff on them. These include CJK characters, emojis, and also subscript/superscript numbers. Meanwhile, multi-byte characters like Cyrillic and Greek letters will still continue to considered as words. Note that this is slightly inconsistent with how words are defined elsewhere, as Vim usually considers any character with class >=2 to be a "word". Related: vim#16881 (diff inline highlight)

ychin · 2025-04-04T09:06:03Z

FWIW I checked other diff programs that do word diff and a lot of them don't solve this correctly. E.g. the default git diff --word-diff behavior just groups all CJK characters as a single word which leads to poor results as described above. p4merge also does the same thing.

Apple's FileMerge does solve this by finding proper word boundaries in CJK languages by doing locale-specific processing (generally macOS text fields do proper segmenting and understands CJK word boundaries), but I don't really think we need to solve that here. If we do want to solve this issue by having word motions be smarter we could look at how web browsers implement the Intl.Segmenter class.

chrisbra · 2025-04-04T17:14:47Z

thanks

Problem: inline word diff treats multibyte chars as word char (after 9.1.1243) Solution: treat all non-alphanumeric characters as non-word characters (Yee Cheng Chin) Previously inline word diff simply used Vim's definition of keyword to determine what is a word, which leads to multi-byte character classes such as emojis and CJK (Chinese/Japanese/Korean) characters all classifying as word characters, leading to entire sentences being grouped as a single word which does not provide meaningful information in a diff highlight. Fix this by treating all non-alphanumeric characters (with class number above 2) as non-word characters, as there is usually no benefit in using word diff on them. These include CJK characters, emojis, and also subscript/superscript numbers. Meanwhile, multi-byte characters like Cyrillic and Greek letters will still continue to considered as words. Note that this is slightly inconsistent with how words are defined elsewhere, as Vim usually considers any character with class >=2 to be a "word". related: vim/vim#16881 (diff inline highlight) closes: vim/vim#17050 vim/vim@9aa120f Co-authored-by: Yee Cheng Chin <[email protected]>

…har (#33323) Problem: inline word diff treats multibyte chars as word char (after 9.1.1243) Solution: treat all non-alphanumeric characters as non-word characters (Yee Cheng Chin) Previously inline word diff simply used Vim's definition of keyword to determine what is a word, which leads to multi-byte character classes such as emojis and CJK (Chinese/Japanese/Korean) characters all classifying as word characters, leading to entire sentences being grouped as a single word which does not provide meaningful information in a diff highlight. Fix this by treating all non-alphanumeric characters (with class number above 2) as non-word characters, as there is usually no benefit in using word diff on them. These include CJK characters, emojis, and also subscript/superscript numbers. Meanwhile, multi-byte characters like Cyrillic and Greek letters will still continue to considered as words. Note that this is slightly inconsistent with how words are defined elsewhere, as Vim usually considers any character with class >=2 to be a "word". related: vim/vim#16881 (diff inline highlight) closes: vim/vim#17050 vim/vim@9aa120f Co-authored-by: Yee Cheng Chin <[email protected]>

ychin mentioned this pull request Apr 4, 2025

Improve diff inline highlighting using per-character/word diff #16881

Closed

2 tasks

chrisbra closed this in 9aa120f Apr 4, 2025

ychin deleted the diff-inline-word-multibyte-class branch April 4, 2025 20:25

zeertzjq mentioned this pull request Apr 5, 2025

vim-patch:9.1.1276: inline word diff treats multibyte chars as word char neovim/neovim#33323

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

diff: Fix inline word diff to treat non-alpha multibyte chars as non-word #17050

diff: Fix inline word diff to treat non-alpha multibyte chars as non-word #17050

Uh oh!

ychin commented Apr 4, 2025 •

edited

Loading

Uh oh!

ychin commented Apr 4, 2025

Uh oh!

chrisbra commented Apr 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

diff: Fix inline word diff to treat non-alpha multibyte chars as non-word #17050

diff: Fix inline word diff to treat non-alpha multibyte chars as non-word #17050

Uh oh!

Conversation

ychin commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ychin commented Apr 4, 2025

Uh oh!

chrisbra commented Apr 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ychin commented Apr 4, 2025 •

edited

Loading