-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Document classification example: One should use 'latin-1' encoding #8229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pull request welcome
…On 24 January 2017 at 21:43, insideloop ***@***.***> wrote:
There seems to be a bug in the documentation of the classification of text
documents: http://scikit-learn.org/stable/auto_examples/text/
mlcomp_sparse_document_classification.html#sphx-glr-
auto-examples-text-mlcomp-sparse-document-classification-py
The files are opened as utf-8 which leads to a bug. I have solved the
issue changing "open(f)" into "open(f, encoding='latin1')".
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8229>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz68pf0DdwKUimcIHkeaNZsqjYR4JZks5rVdXugaJpZM4LsG8s>
.
|
Yes, but Also what's the difference between the MLComp 20 newsgoups in this example and the 20 newgroups dataset used in the other text classification example obtained with |
Yes, using input='filename' seems reasonable.
…On 24 January 2017 at 22:33, Roman Yurchak ***@***.***> wrote:
Yes, but open does not accept an encoding parameter in Python 2; maybe
initializing TfidfVectorizer with the input="filename" and letting it
handle the encoding would be better?
Also what's the difference between the MLComp 20 newsgoups in this example
<http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#sphx-glr-auto-examples-text-mlcomp-sparse-document-classification-py>
and the 20 newgroups dataset used in the other text classification example
<http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py>
obtained with `fetch_20newsgroups|. Isn't that the same dataset?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8229 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz64K7EUzFLPByWOUc93JzYNHY_sy1ks5rVeFwgaJpZM4LsG8s>
.
|
I was wondering exactly the same thing and I don't know the answer. |
@lesteve It looks like it is almost the same dataset, the MLComp 20 newsgroup example indicates that it uses the
while the original Newsgroup dataset which can be obtained with
where headers can be removed with I don't think that a few duplicates and differences in the kept header fields are fundamental to demonstrate text categorization in this example. Which makes the the MLComp 20 newsgroup example somewhat redundant with the other text classification example. Maybe the the MLComp 20 newsgroup example should be rewritten to use Also it looks like neither of those examples are run in the example gallery, is it because they need to download an external dataset and take longer to run? |
yes, it might have been due to slow download. but also, we only run things
titled plot_*
…On 1 Feb 2017 3:52 am, "Roman Yurchak" ***@***.***> wrote:
@lesteve <https://github.com/lesteve> It looks like it is almost the same
dataset, the MLComp 20 newsgroup example
<http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#sphx-glr-auto-examples-text-mlcomp-sparse-document-classification-py>
indicates that it uses the 20news-18828.tar.gz dataset from
http://qwone.com/~jason/20Newsgroups/. This corresponds to
20news-18828.tar.gz 20 Newsgroups; duplicates removed, only "From" and
"Subject" headers (18828 documents)
while the original Newsgroup dataset which can be obtained with
sklearn.datasets.fetch_20newsgroups, is
20news-19997.tar.gz - Original 20 Newsgroups data set (18846 documents)
where headers can be removed with remove=(headers,).
I don't think that a few duplicates and differences in the kept header
fields are fundamental to demonstrate text categorization in this example.
Which makes the the MLComp 20 newsgroup example
<http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#sphx-glr-auto-examples-text-mlcomp-sparse-document-classification-py>
somewhat redundant with the other text classification example
<http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py>.
Maybe the the MLComp 20 newsgroup example should be rewritten to use
fetch_20newsgroups or made less redundant?
Also it looks like neither of those examples are run in the example
gallery, is it because they need to download an external dataset and take
longer to run?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8229 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xOSD0u6ml7efyzcpT1RWHb_f8G5ks5rX2aygaJpZM4LsG8s>
.
|
I would remove the MLComp 20 newsgroup example. Quickly looking at the source code both examples are very similar. I would be in favour of deprecating
Another thing worth investigating would be to look at the examples that are not run (i.e. the ones not starting with |
OK, I can make a PR for it.
It is indeed redundant with |
this can be closed as the example was removed, right? |
There seems to be a bug in the documentation of the classification of text documents: http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#sphx-glr-auto-examples-text-mlcomp-sparse-document-classification-py
The files are opened as utf-8 which leads to a bug. I have solved the issue changing "open(f)" into "open(f, encoding='latin1')".
The text was updated successfully, but these errors were encountered: