Open
Conversation
|
I can confirm this PR works for a previously broken case which a few japanese letters mixed in chinese letters, maybe we can proceed and release a new version. |
|
We'd love to get this PR integrated for our work with @spamscanner v7 |
|
Can confirm that this branch worked with one of the cases I was running in to. I've installed from that fork/branch for now but would be great to get this merged and released. Appreciate everyone involved and the time you commit to these things. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello, after finding again issue #84, I decided to share my attempt to fix it (or rather, improve the situation a bit).
The approach proposed by this PR leverages the delta between the
cmnscore and thejpnone. The issue in #84 is caused by the fact thatcmndoesn't match kana (Japanese-only characters), butjpnmatches (many) Chinese characters, so it will end up with a higher score thancmn.In particular, the example sentence mentioned in the issue, has a
0.86score onjpn, and a0.74score oncmn, due to the presence of 5 katakana characters out of a total of 42 characters. This means that the delta is around 12% ((0.86 - 0.74) * 100).This change enforces a minimum of
0.15higherjpnscore, otherwisecmngets priority. This seems reasonable, as we can consider anything above 15% (around 1 every 6 characters) "a fair amount of kana".With this new approach, the example that I had originally raised as "this should be detected as Japanese" in #77 would fail, and be detected as Mandarin instead, because it contains just 1 kana out of a total of 11 characters. However, that example was pretty far-fetched, and it is unlikely to find such a kanji-dense sentence in a regular Japanese text. And as usual, this disclaimer always apply...
This approach is still fragile when compared to what machine translators (like Google translate) do, but it was the best solution I could think of without recurring to grammar checks (which is what Google translate likely does), as that is what kana are mostly used for in Japanese.
Also, this is missing a similar check on Korean vs Mandarin. Unfortunately, I do not know Korean, so I cannot add this check myself.
I'm open to suggestions/opinions on the proposed approach, especially from people involved in the original discussion (if they are still around and interested in the topic). @wooorm @kewang @niftylettuce
Fixes #84.