Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Yunin
Copy link

@Yunin Yunin commented Dec 3, 2020

Chinese,Japanse and Korean(CJK), word count split by no space . if text contain english , CJK word may cause some error.
Just like this: "上海大学是一个好大学 this is a test." will be split to words "上海大学是一个好大学" ,"this" ,"is","a","test".
and will be detected to ENGLISH。 But actually, the text belong to CHINESE, because of CJK word count split by no space.
"上海大学是一个好大学 this is a test." need to be split to words "上","海","大","学","是","一","个",好","大",学" ,"this" ,"is","a","test".

We need to specific the language, which can be split by space or not. And we can correct in countword considering by language .

@pemistahl
Copy link
Owner

Hi @Yunin, thanks for your pull request.

The word splitting algorithm is indeed something that I was planning to improve. I'm going to review your changes now and make some suggestions for improving your code a bit.

Comment on lines 19 to 24
internal fun <T> MutableMap<T, Int>.incrementCounter(key: T) {
this[key] = this.getOrDefault(key, 0) + 1
}
internal fun <T> MutableMap<T, Int>.incrementCounter(key: T, wordSize: Int) {
this[key] = this.getOrDefault(key, 0) + wordSize
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

  1. I wouldn't call the new argument wordSize but simply amount. Because, in the way you implemented it, wordSize is not the length of a single word but the number of words in the input string.

  2. The former method in lines 19-21 is now obsolete and can be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yunin Can you please work on the aspects that I mentioned? You are perfectly right that words for languages supporting logograms are not counted correctly as of yet. However, unfortunately, your approach is not enough in solving this problem and would produce several other bugs. If you put more work into it, I think we will be able to come to a state that can be merged eventually. Thank you! :)

Thanks for your reply. The several moifications you mentioned are right. I will change the code according to your suggestions. Thanks a lot.

Comment on lines 30 to 31
final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(GERMAN, ENGLISH, FRENCH).build();
assertThat(detector.detectLanguageOf("groß")).isEqualTo(GERMAN);
final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(CHINESE, ENGLISH, FRENCH).build();
assertThat(detector.detectLanguageOf("上海大学是一个好大学 this is a test.")).isEqualTo(CHINESE);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add a new test case instead of overwriting an existing one.

  2. Also, consider adding your new test case to the Kotlin tests, not to the Java tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add test cases in new Kotlin file .

Comment on lines 26 to 30

/**
* To define the languages word split by NO SPACE , just like CHINESE, JAPANESE, KOREAN and so on.
*/
val LANGUAGES_SPLIT_BY_NO_SPACE = listOf(Language.CHINESE, Language.JAPANESE, Language.KOREAN)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please remove the comment. It doesn't add any new information.

  2. I'd rather call this constant LANGUAGES_SUPPORTING_LOGOGRAMS. A logogram is the correct term for a single character that represents an entire word.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you suggestion. The item of logogram will be much better than languages. And I will remove the comment, becasue the meaning can be recognized by variable name.

Comment on lines 261 to 265
} else if (wordLanguageCounts.size == 1) {
val language = wordLanguageCounts.toList().first().first
var wordSize = if (language in Constant.LANGUAGES_SPLIT_BY_NO_SPACE) wordLanguageCounts[language] else 1
if (language in languages) {
totalLanguageCounts.incrementCounter(language)
wordSize?.let { totalLanguageCounts.incrementCounter(language, it) }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think the block you put your code in is the wrong place here. This only works for input strings consisting of one and only one language. What if the input string contains substrings in both English and Chinese, for example? Then your algorithm would not work as intended. You actually have to iterate over all languages in wordLanguageCounts and check whether they belong to the subset supporting logograms.

  2. A Chinese substring consisting of five characters, for example, without any spaces in between the characters would be counted as a single word. Therefore, wordLanguageCounts would contain the count 1 for this language. However, in line 263, you would have to determine the length of the Chinese substring instead of looking into the map wordLanguageCounts. So, I'm afraid, your whole approach in solving this problem ought to be reworked a bit.

  3. In line 263, there is no need to make wordSize mutable using var. Please make it immutable using val instead.

  4. In line 265, you check whether wordSize is null or not. However, wordSize can never be null as defined in line 263. So I think it would be safe to just write totalLanguageCounts.incrementCounter(language, wordSize).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the previous word would be the basic unit to count size(The size only be 1). The better way to solve the problem is that correct the method of LanguageDetector.splitTextIntoWords. If every word belong to one language, the problem will be sovled.If i want to correct the LanguageDetector.splitTextIntoWords , the text will be scaned all characters in LanguageDetector.splitTextIntoWords. And in LanguageDetector.detectLanguageWithRules also scan all characters. And It cost double time to do the same things( scaning all characters). That's why i put code here. That's the problem, I will find the way to solve this. Thanks again, next time i will consider your advices before i push the code.

Copy link
Owner

@pemistahl pemistahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yunin Can you please work on the aspects that I mentioned? You are perfectly right that words for languages supporting logograms are not counted correctly as of yet. However, unfortunately, your approach is not enough in solving this problem and would produce several other bugs. If you put more work into it, I think we will be able to come to a state that can be merged eventually. Thank you! :)

@pemistahl pemistahl added the enhancement New feature or request label Dec 9, 2020
@pemistahl pemistahl changed the title fix: Chinese,Japanse and Korean(CJK), word count split by no space . if text… Fix word splitting algorithm for languages supporting logograms Dec 10, 2020
@Yunin
Copy link
Author

Yunin commented Dec 15, 2020

@Yunin Can you please work on the aspects that I mentioned? You are perfectly right that words for languages supporting logograms are not counted correctly as of yet. However, unfortunately, your approach is not enough in solving this problem and would produce several other bugs. If you put more work into it, I think we will be able to come to a state that can be merged eventually. Thank you! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants