Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

polm
Copy link

@polm polm commented Jul 18, 2017

Github isn't showing the file properly, so to be clear, I changed this line:

SYMBOL,1283,1283,17585,名詞,サ変接続,*,*,*,*,*

to this:

SYMBOL,1283,1283,17585,記号,一般,*,*,*,*,*

The previous setting makes no sense and has confused many people. I guess it was a mistake?

The jumandic unk.def did not seem to have this problem.

If there's anything I should improve, please let me know.

Many thanks for providing Mecab.

Unknown symbols are not nouns. -POLM
@polm
Copy link
Author

polm commented Mar 16, 2019

Hello. This PR has been here for over a year, it would be great to have it addressed one way or another.

I will add that I realized why the current setting is in place. There's a footnote in "Applying Conditional Random Fields to Japanese Morphological Analysis" that explains it:

JUMAN assigns “unknown POS” to the words not seen in
the lexicon. We simply replace the POS of these words with
the default POS, Noun-SAHEN.

While that sounds reasonable, the articles I linked above and the issue that has been linked to this PR since I originally posted it show that this setting causes confusion and I still think it should be changed.

Any feedback at all would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant