Fix word splitting algorithm for languages supporting logograms #81

Yunin · 2020-12-03T03:34:19Z

Chinese,Japanse and Korean(CJK), word count split by no space . if text contain english , CJK word may cause some error.
Just like this: "上海大学是一个好大学 this is a test." will be split to words "上海大学是一个好大学" ,"this" ,"is","a","test".
and will be detected to ENGLISH。 But actually, the text belong to CHINESE, because of CJK word count split by no space.
"上海大学是一个好大学 this is a test." need to be split to words "上","海","大","学","是","一","个",好","大",学" ,"this" ,"is","a","test".

We need to specific the language, which can be split by space or not. And we can correct in countword considering by language .

… contain english , CJK word may cause some error.

pemistahl · 2020-12-09T10:02:33Z

Hi @Yunin, thanks for your pull request.

The word splitting algorithm is indeed something that I was planning to improve. I'm going to review your changes now and make some suggestions for improving your code a bit.

pemistahl · 2020-12-09T10:19:05Z

src/main/kotlin/com/github/pemistahl/lingua/internal/util/extension/MapExtensions.kt

 internal fun <T> MutableMap<T, Int>.incrementCounter(key: T) {
    this[key] = this.getOrDefault(key, 0) + 1
 }
+internal fun <T> MutableMap<T, Int>.incrementCounter(key: T, wordSize: Int) {
+    this[key] = this.getOrDefault(key, 0) + wordSize
+}


Two things here:

I wouldn't call the new argument wordSize but simply amount. Because, in the way you implemented it, wordSize is not the length of a single word but the number of words in the input string.

The former method in lines 19-21 is now obsolete and can be removed.

@Yunin Can you please work on the aspects that I mentioned? You are perfectly right that words for languages supporting logograms are not counted correctly as of yet. However, unfortunately, your approach is not enough in solving this problem and would produce several other bugs. If you put more work into it, I think we will be able to come to a state that can be merged eventually. Thank you! :)

Thanks for your reply. The several moifications you mentioned are right. I will change the code according to your suggestions. Thanks a lot.

pemistahl · 2020-12-09T10:20:51Z

src/test/java/com/github/pemistahl/lingua/api/LanguageDetectorBuilderJavaTest.java

-        final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(GERMAN, ENGLISH, FRENCH).build();
-        assertThat(detector.detectLanguageOf("groß")).isEqualTo(GERMAN);
+        final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(CHINESE, ENGLISH, FRENCH).build();
+        assertThat(detector.detectLanguageOf("上海大学是一个好大学  this is a test.")).isEqualTo(CHINESE);


Please add a new test case instead of overwriting an existing one.

Also, consider adding your new test case to the Kotlin tests, not to the Java tests.

I will add test cases in new Kotlin file .

pemistahl · 2020-12-09T10:26:17Z

src/main/kotlin/com/github/pemistahl/lingua/internal/Constant.kt

+
+    /**
+     * To define the languages  word split by NO SPACE , just like CHINESE, JAPANESE, KOREAN and so on.
+     */
+    val LANGUAGES_SPLIT_BY_NO_SPACE = listOf(Language.CHINESE, Language.JAPANESE, Language.KOREAN)


Please remove the comment. It doesn't add any new information.

I'd rather call this constant LANGUAGES_SUPPORTING_LOGOGRAMS. A logogram is the correct term for a single character that represents an entire word.

Thanks for you suggestion. The item of logogram will be much better than languages. And I will remove the comment, becasue the meaning can be recognized by variable name.

pemistahl · 2020-12-09T10:43:28Z

src/main/kotlin/com/github/pemistahl/lingua/api/LanguageDetector.kt

            } else if (wordLanguageCounts.size == 1) {
                val language = wordLanguageCounts.toList().first().first
+                var wordSize = if (language in Constant.LANGUAGES_SPLIT_BY_NO_SPACE) wordLanguageCounts[language] else 1
                if (language in languages) {
-                    totalLanguageCounts.incrementCounter(language)
+                    wordSize?.let { totalLanguageCounts.incrementCounter(language, it) }


I think the block you put your code in is the wrong place here. This only works for input strings consisting of one and only one language. What if the input string contains substrings in both English and Chinese, for example? Then your algorithm would not work as intended. You actually have to iterate over all languages in wordLanguageCounts and check whether they belong to the subset supporting logograms.

A Chinese substring consisting of five characters, for example, without any spaces in between the characters would be counted as a single word. Therefore, wordLanguageCounts would contain the count 1 for this language. However, in line 263, you would have to determine the length of the Chinese substring instead of looking into the map wordLanguageCounts. So, I'm afraid, your whole approach in solving this problem ought to be reworked a bit.

In line 263, there is no need to make wordSize mutable using var. Please make it immutable using val instead.

In line 265, you check whether wordSize is null or not. However, wordSize can never be null as defined in line 263. So I think it would be safe to just write totalLanguageCounts.incrementCounter(language, wordSize).

Yep, the previous word would be the basic unit to count size(The size only be 1). The better way to solve the problem is that correct the method of LanguageDetector.splitTextIntoWords. If every word belong to one language, the problem will be sovled.If i want to correct the LanguageDetector.splitTextIntoWords , the text will be scaned all characters in LanguageDetector.splitTextIntoWords. And in LanguageDetector.detectLanguageWithRules also scan all characters. And It cost double time to do the same things( scaning all characters). That's why i put code here. That's the problem, I will find the way to solve this. Thanks again, next time i will consider your advices before i push the code.

pemistahl

@Yunin Can you please work on the aspects that I mentioned? You are perfectly right that words for languages supporting logograms are not counted correctly as of yet. However, unfortunately, your approach is not enough in solving this problem and would produce several other bugs. If you put more work into it, I think we will be able to come to a state that can be merged eventually. Thank you! :)

Yunin · 2020-12-15T01:32:35Z

@Yunin Can you please work on the aspects that I mentioned? You are perfectly right that words for languages supporting logograms are not counted correctly as of yet. However, unfortunately, your approach is not enough in solving this problem and would produce several other bugs. If you put more work into it, I think we will be able to come to a state that can be merged eventually. Thank you! :)

Yunin and others added 10 commits December 2, 2020 18:41

fix: Chinese,Japanse and Korean(CJK), word count by no split. if text…

4f51e9f

… contain english , CJK word may cause some error.

Update LanguageDetectorTest.kt

a7923f9

Update LanguageDetectorBuilderJavaTest.java

e2cb0cf

Update MapExtensions.kt

09c450f

Update Constant.kt

792a02c

Update Constant.kt

cb2ba0c

Update LanguageDetector.kt

e99d746

Update LanguageDetectorBuilderJavaTest.java

7b8571f

Update LanguageDetectorBuilderJavaTest.java

210efbb

test:

fceec50

pemistahl reviewed Dec 9, 2020

View reviewed changes

pemistahl requested changes Dec 9, 2020

View reviewed changes

pemistahl added the enhancement New feature or request label Dec 9, 2020

pemistahl changed the title ~~fix: Chinese,Japanse and Korean(CJK), word count split by no space . if text…~~ Fix word splitting algorithm for languages supporting logograms Dec 10, 2020

Yunin added 3 commits December 14, 2020 16:52

fix: fix the issue of logogramWordCountIfExist

3a6a5c6

fix: fix the issue of logogramWordCountIfExist

c34296b

fix: fix the issue of ktlint main source set check

fe7bb8b

Yunin closed this Dec 15, 2020

Yunin mentioned this pull request Dec 15, 2020

fix: append space in splitTextIntoWord if text contain logogram #85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix word splitting algorithm for languages supporting logograms #81

Fix word splitting algorithm for languages supporting logograms #81

Uh oh!

Yunin commented Dec 3, 2020 •

edited

Loading

Uh oh!

pemistahl commented Dec 9, 2020

Uh oh!

pemistahl Dec 9, 2020

Uh oh!

Yunin Dec 10, 2020

Uh oh!

pemistahl Dec 9, 2020

Uh oh!

Yunin Dec 10, 2020

Uh oh!

pemistahl Dec 9, 2020

Uh oh!

Yunin Dec 10, 2020

Uh oh!

pemistahl Dec 9, 2020

Uh oh!

Yunin Dec 10, 2020

Uh oh!

pemistahl left a comment

Uh oh!

Yunin commented Dec 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix word splitting algorithm for languages supporting logograms #81

Fix word splitting algorithm for languages supporting logograms #81

Uh oh!

Conversation

Yunin commented Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pemistahl commented Dec 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pemistahl left a comment

Choose a reason for hiding this comment

Uh oh!

Yunin commented Dec 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yunin commented Dec 3, 2020 •

edited

Loading