-
Couldn't load subscription status.
- Fork 75
Fix word splitting algorithm for languages supporting logograms #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… contain english , CJK word may cause some error.
|
Hi @Yunin, thanks for your pull request. The word splitting algorithm is indeed something that I was planning to improve. I'm going to review your changes now and make some suggestions for improving your code a bit. |
| internal fun <T> MutableMap<T, Int>.incrementCounter(key: T) { | ||
| this[key] = this.getOrDefault(key, 0) + 1 | ||
| } | ||
| internal fun <T> MutableMap<T, Int>.incrementCounter(key: T, wordSize: Int) { | ||
| this[key] = this.getOrDefault(key, 0) + wordSize | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things here:
-
I wouldn't call the new argument
wordSizebut simplyamount. Because, in the way you implemented it,wordSizeis not the length of a single word but the number of words in the input string. -
The former method in lines 19-21 is now obsolete and can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yunin Can you please work on the aspects that I mentioned? You are perfectly right that words for languages supporting logograms are not counted correctly as of yet. However, unfortunately, your approach is not enough in solving this problem and would produce several other bugs. If you put more work into it, I think we will be able to come to a state that can be merged eventually. Thank you! :)
Thanks for your reply. The several moifications you mentioned are right. I will change the code according to your suggestions. Thanks a lot.
| final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(GERMAN, ENGLISH, FRENCH).build(); | ||
| assertThat(detector.detectLanguageOf("groß")).isEqualTo(GERMAN); | ||
| final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(CHINESE, ENGLISH, FRENCH).build(); | ||
| assertThat(detector.detectLanguageOf("上海大学是一个好大学 this is a test.")).isEqualTo(CHINESE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Please add a new test case instead of overwriting an existing one.
-
Also, consider adding your new test case to the Kotlin tests, not to the Java tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add test cases in new Kotlin file .
|
|
||
| /** | ||
| * To define the languages word split by NO SPACE , just like CHINESE, JAPANESE, KOREAN and so on. | ||
| */ | ||
| val LANGUAGES_SPLIT_BY_NO_SPACE = listOf(Language.CHINESE, Language.JAPANESE, Language.KOREAN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Please remove the comment. It doesn't add any new information.
-
I'd rather call this constant
LANGUAGES_SUPPORTING_LOGOGRAMS. A logogram is the correct term for a single character that represents an entire word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you suggestion. The item of logogram will be much better than languages. And I will remove the comment, becasue the meaning can be recognized by variable name.
| } else if (wordLanguageCounts.size == 1) { | ||
| val language = wordLanguageCounts.toList().first().first | ||
| var wordSize = if (language in Constant.LANGUAGES_SPLIT_BY_NO_SPACE) wordLanguageCounts[language] else 1 | ||
| if (language in languages) { | ||
| totalLanguageCounts.incrementCounter(language) | ||
| wordSize?.let { totalLanguageCounts.incrementCounter(language, it) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think the block you put your code in is the wrong place here. This only works for input strings consisting of one and only one language. What if the input string contains substrings in both English and Chinese, for example? Then your algorithm would not work as intended. You actually have to iterate over all languages in
wordLanguageCountsand check whether they belong to the subset supporting logograms. -
A Chinese substring consisting of five characters, for example, without any spaces in between the characters would be counted as a single word. Therefore,
wordLanguageCountswould contain the count 1 for this language. However, in line 263, you would have to determine the length of the Chinese substring instead of looking into the mapwordLanguageCounts. So, I'm afraid, your whole approach in solving this problem ought to be reworked a bit. -
In line 263, there is no need to make
wordSizemutable usingvar. Please make it immutable usingvalinstead. -
In line 265, you check whether
wordSizeis null or not. However,wordSizecan never be null as defined in line 263. So I think it would be safe to just writetotalLanguageCounts.incrementCounter(language, wordSize).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the previous word would be the basic unit to count size(The size only be 1). The better way to solve the problem is that correct the method of LanguageDetector.splitTextIntoWords. If every word belong to one language, the problem will be sovled.If i want to correct the LanguageDetector.splitTextIntoWords , the text will be scaned all characters in LanguageDetector.splitTextIntoWords. And in LanguageDetector.detectLanguageWithRules also scan all characters. And It cost double time to do the same things( scaning all characters). That's why i put code here. That's the problem, I will find the way to solve this. Thanks again, next time i will consider your advices before i push the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yunin Can you please work on the aspects that I mentioned? You are perfectly right that words for languages supporting logograms are not counted correctly as of yet. However, unfortunately, your approach is not enough in solving this problem and would produce several other bugs. If you put more work into it, I think we will be able to come to a state that can be merged eventually. Thank you! :)
|
Chinese,Japanse and Korean(CJK), word count split by no space . if text contain english , CJK word may cause some error.
Just like this: "上海大学是一个好大学 this is a test." will be split to words "上海大学是一个好大学" ,"this" ,"is","a","test".
and will be detected to ENGLISH。 But actually, the text belong to CHINESE, because of CJK word count split by no space.
"上海大学是一个好大学 this is a test." need to be split to words "上","海","大","学","是","一","个",好","大",学" ,"this" ,"is","a","test".
We need to specific the language, which can be split by space or not. And we can correct in countword considering by language .