Text classification workflow#1025
Conversation
dd7153f to
edd781d
Compare
|
@gheinrich , in encode_entry(), if one char is encoded as k, and another char is encoded as k+1, does that imply these two char's are close to each other? I mean encoding characters into scalar, rather than one-hot-encoding, seems not common. For text classification, word2vec seems a more popular to encode document into vector space. |
|
Hi @IsaacYangSLA, in the example network that I provide, the first layer is doing one-hot encoding of the characters. I chose to do the one-hot encoding in the network rather than in the dataset because that results in a much more compact dataset, especially if you have a large alphabet. It's just my opinion, but I think "word2vec" kind of defeats the purpose of deep learning: you need logic outside of the network, like stemming algorithms, to identify words. I suppose the popularity of word2vec comes down to the limited memory/compute capabilities. I think a character-level representation of the data should be ultimately more powerful, similar to how Deep Neural Nets outperform HOG+SVM in image classification. |
|
Hi @gheinrich , thanks for the information of first layer. That's a better design, I agree. For the word2vec or character part, it seems more people use word2vec in NLP applications and the idea behind it is also reasonable, i.e. the concept of DC - USA + France ~= Paris. However, I see increasing researches are now on character-based text processing. Maybe in a few years, it will outperform word2vec in NLP applications. |
edd781d to
74039af
Compare
|
rebased on tip of master branch |
| author="Greg Heinrich", | ||
| description=("A data ingestion plugin for text classification"), | ||
| long_description=read('README'), | ||
| license="Apache", |
There was a problem hiding this comment.
I didn't really think of it - good point, I'll use the same license as the top-level setup.py
There was a problem hiding this comment.
done on latest commit
| scores = output_data[output_data.keys()[0]].astype('float32') | ||
|
|
||
| if self.terminal_layer_type == "logsoftmax": | ||
| scores = np.exp(scores) |
There was a problem hiding this comment.
Could you just check to see if the values sum to 1 instead of having this form field?
There was a problem hiding this comment.
good idea (I guess I can also check if values are positive or negative)!
There was a problem hiding this comment.
done on latest commit
lukeyeager
left a comment
There was a problem hiding this comment.
I ran through the text classification example using these plugins and it worked great. I'd like to see the logsoftmax thing removed before merge since it will make this a bit easier to use.
74039af to
f8447b0
Compare
f8447b0 to
e7745fa
Compare
|
I have updated the text classification example to show how to use the plug-ins |
|
|
||
| if np.max(scores) < 0: | ||
| # terminal layer is a logsoftmax | ||
| scores = np.exp(scores) |
| ```sh | ||
| $ pip install $DIGITS_ROOT/plugins/data/textClassification | ||
| $ pip install $DIGITS_ROOT/plugins/view/textClassification | ||
| ``` |
…-workflow Text classification workflow
Depends on #927 (support for plug-ins) and #1024 (modularization of inference).