Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ychin
Copy link
Contributor

@ychin ychin commented Apr 4, 2025

Previously inline word diff simply used Vim's definition of keyword to determine what is a word, which leads to multi-byte character classes such as emojis and CJK (Chinese/Japanese/Korean) characters all classifying as word characters, leading to entire sentences being grouped as a single word which does not provide meaningful information in a diff highlight (since those characters are not necessarily separated by space).

Fix this by treating all non-alphanumeric characters (with class number above 2) as non-word characters, as there is usually no benefit in using word diff on them. These include CJK characters, emojis, and also subscript/superscript numbers. Meanwhile, multi-byte characters like Cyrillic and Greek letters will still continue to considered as words.

Note that this is slightly inconsistent with how words are defined elsewhere, as Vim usually considers any character with class >=2 to be a "word".

Related: #16881 (diff inline highlight)

…word

Previously inline word diff simply used Vim's definition of keyword to
determine what is a word, which leads to multi-byte character classes
such as emojis and CJK (Chinese/Japanese/Korean) characters all
classifying as word characters, leading to entire sentences being
grouped as a single word which does not provide meaningful information
in a diff highlight.

Fix this by treating all non-alphanumeric characters (with class number
above 2) as non-word characters, as there is usually no benefit in using
word diff on them. These include CJK characters, emojis, and also
subscript/superscript numbers. Meanwhile, multi-byte characters like
Cyrillic and Greek letters will still continue to considered as words.

Note that this is slightly inconsistent with how words are defined
elsewhere, as Vim usually considers any character with class >=2 to be
a "word".

Related: vim#16881 (diff inline highlight)
@ychin
Copy link
Contributor Author

ychin commented Apr 4, 2025

FWIW I checked other diff programs that do word diff and a lot of them don't solve this correctly. E.g. the default git diff --word-diff behavior just groups all CJK characters as a single word which leads to poor results as described above. p4merge also does the same thing.

Apple's FileMerge does solve this by finding proper word boundaries in CJK languages by doing locale-specific processing (generally macOS text fields do proper segmenting and understands CJK word boundaries), but I don't really think we need to solve that here. If we do want to solve this issue by having word motions be smarter we could look at how web browsers implement the Intl.Segmenter class.

@chrisbra
Copy link
Member

chrisbra commented Apr 4, 2025

thanks

@chrisbra chrisbra closed this in 9aa120f Apr 4, 2025
@ychin ychin deleted the diff-inline-word-multibyte-class branch April 4, 2025 20:25
zeertzjq added a commit to zeertzjq/neovim that referenced this pull request Apr 5, 2025
Problem:  inline word diff treats multibyte chars as word char
          (after 9.1.1243)
Solution: treat all non-alphanumeric characters as non-word characters
          (Yee Cheng Chin)

Previously inline word diff simply used Vim's definition of keyword to
determine what is a word, which leads to multi-byte character classes
such as emojis and CJK (Chinese/Japanese/Korean) characters all
classifying as word characters, leading to entire sentences being
grouped as a single word which does not provide meaningful information
in a diff highlight.

Fix this by treating all non-alphanumeric characters (with class number
above 2) as non-word characters, as there is usually no benefit in using
word diff on them. These include CJK characters, emojis, and also
subscript/superscript numbers. Meanwhile, multi-byte characters like
Cyrillic and Greek letters will still continue to considered as words.

Note that this is slightly inconsistent with how words are defined
elsewhere, as Vim usually considers any character with class >=2 to be
a "word".

related: vim/vim#16881 (diff inline highlight)
closes: vim/vim#17050

vim/vim@9aa120f

Co-authored-by: Yee Cheng Chin <[email protected]>
zeertzjq added a commit to neovim/neovim that referenced this pull request Apr 5, 2025
…har (#33323)

Problem:  inline word diff treats multibyte chars as word char
          (after 9.1.1243)
Solution: treat all non-alphanumeric characters as non-word characters
          (Yee Cheng Chin)

Previously inline word diff simply used Vim's definition of keyword to
determine what is a word, which leads to multi-byte character classes
such as emojis and CJK (Chinese/Japanese/Korean) characters all
classifying as word characters, leading to entire sentences being
grouped as a single word which does not provide meaningful information
in a diff highlight.

Fix this by treating all non-alphanumeric characters (with class number
above 2) as non-word characters, as there is usually no benefit in using
word diff on them. These include CJK characters, emojis, and also
subscript/superscript numbers. Meanwhile, multi-byte characters like
Cyrillic and Greek letters will still continue to considered as words.

Note that this is slightly inconsistent with how words are defined
elsewhere, as Vim usually considers any character with class >=2 to be
a "word".

related: vim/vim#16881 (diff inline highlight)
closes: vim/vim#17050

vim/vim@9aa120f

Co-authored-by: Yee Cheng Chin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants