Document classification example: One should use 'latin-1' encoding #8229

FrancoisFayard · 2017-01-24T10:43:57Z

There seems to be a bug in the documentation of the classification of text documents: http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#sphx-glr-auto-examples-text-mlcomp-sparse-document-classification-py

The files are opened as utf-8 which leads to a bug. I have solved the issue changing "open(f)" into "open(f, encoding='latin1')".

jnothman · 2017-01-24T10:50:16Z

Pull request welcome

…

On 24 January 2017 at 21:43, insideloop ***@***.***> wrote: There seems to be a bug in the documentation of the classification of text documents: http://scikit-learn.org/stable/auto_examples/text/ mlcomp_sparse_document_classification.html#sphx-glr- auto-examples-text-mlcomp-sparse-document-classification-py The files are opened as utf-8 which leads to a bug. I have solved the issue changing "open(f)" into "open(f, encoding='latin1')". — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8229>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68pf0DdwKUimcIHkeaNZsqjYR4JZks5rVdXugaJpZM4LsG8s> .

rth · 2017-01-24T11:33:03Z

Yes, but open does not accept an encoding parameter in Python 2; maybe initializing TfidfVectorizer with the input="filename" and letting it handle the encoding would be better?

Also what's the difference between the MLComp 20 newsgoups in this example and the 20 newgroups dataset used in the other text classification example obtained with fetch_20newsgroups. Isn't that the same dataset?

jnothman · 2017-01-24T11:56:59Z

Yes, using input='filename' seems reasonable.

…

On 24 January 2017 at 22:33, Roman Yurchak ***@***.***> wrote: Yes, but open does not accept an encoding parameter in Python 2; maybe initializing TfidfVectorizer with the input="filename" and letting it handle the encoding would be better? Also what's the difference between the MLComp 20 newsgoups in this example <http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#sphx-glr-auto-examples-text-mlcomp-sparse-document-classification-py> and the 20 newgroups dataset used in the other text classification example <http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py> obtained with `fetch_20newsgroups|. Isn't that the same dataset? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8229 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz64K7EUzFLPByWOUc93JzYNHY_sy1ks5rVeFwgaJpZM4LsG8s> .

lesteve · 2017-01-24T12:04:07Z

MLComp 20 newsgoups in this example and the 20 newgroups dataset used in the other text classification example obtained with `fetch_20newsgroups

I was wondering exactly the same thing and I don't know the answer.

rth · 2017-01-31T16:51:59Z

@lesteve It looks like it is almost the same dataset, the MLComp 20 newsgroup example indicates that it uses the 20news-18828.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/. This corresponds to

20news-18828.tar.gz 20 Newsgroups; duplicates removed, only "From" and "Subject" headers (18828 documents)

while the original Newsgroup dataset which can be obtained with sklearn.datasets.fetch_20newsgroups, is

20news-19997.tar.gz - Original 20 Newsgroups data set (18846 documents)

where headers can be removed with remove=(headers,).

I don't think that a few duplicates and differences in the kept header fields are fundamental to demonstrate text categorization in this example. Which makes the the MLComp 20 newsgroup example somewhat redundant with the other text classification example. Maybe the the MLComp 20 newsgroup example should be rewritten to use fetch_20newsgroups or made less redundant?

Also it looks like neither of those examples are run in the example gallery, is it because they need to download an external dataset and take longer to run?

jnothman · 2017-01-31T20:20:07Z

yes, it might have been due to slow download. but also, we only run things titled plot_*

…

On 1 Feb 2017 3:52 am, "Roman Yurchak" ***@***.***> wrote: @lesteve <https://github.com/lesteve> It looks like it is almost the same dataset, the MLComp 20 newsgroup example <http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#sphx-glr-auto-examples-text-mlcomp-sparse-document-classification-py> indicates that it uses the 20news-18828.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/. This corresponds to 20news-18828.tar.gz 20 Newsgroups; duplicates removed, only "From" and "Subject" headers (18828 documents) while the original Newsgroup dataset which can be obtained with sklearn.datasets.fetch_20newsgroups, is 20news-19997.tar.gz - Original 20 Newsgroups data set (18846 documents) where headers can be removed with remove=(headers,). I don't think that a few duplicates and differences in the kept header fields are fundamental to demonstrate text categorization in this example. Which makes the the MLComp 20 newsgroup example <http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#sphx-glr-auto-examples-text-mlcomp-sparse-document-classification-py> somewhat redundant with the other text classification example <http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py>. Maybe the the MLComp 20 newsgroup example should be rewritten to use fetch_20newsgroups or made less redundant? Also it looks like neither of those examples are run in the example gallery, is it because they need to download an external dataset and take longer to run? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8229 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xOSD0u6ml7efyzcpT1RWHb_f8G5ks5rX2aygaJpZM4LsG8s> .

lesteve · 2017-02-01T07:05:51Z

Maybe the the MLComp 20 newsgroup example should be rewritten to use fetch_20newsgroups or made less redundant?

I would remove the MLComp 20 newsgroup example. Quickly looking at the source code both examples are very similar. I would be in favour of deprecating load_mlcomp which is not used anywhere else and is very cumbersome to use (need to manually download the data and set an environment variable) and does not have any added value compared to fetch_20newsgroups.

Also it looks like neither of those examples are run in the example gallery, is it because they need to download an external dataset and take longer to run?

Another thing worth investigating would be to look at the examples that are not run (i.e. the ones not starting with plot_) and checking how long they take to run. I tried a few and they were not taking too much time. Renaming these examples to start with plot_ would make them better tested and more accessible through a gallery thumbnail.

rth · 2017-02-01T22:42:14Z

I would remove the MLComp 20 newsgroup example. Quickly looking at the source code both examples are very similar.

OK, I can make a PR for it.

I would be in favour of deprecating load_mlcomp which is not used anywhere else and is very cumbersome to use (need to manually download the data and set an environment variable) and does not have any added value compared to fetch_20newsgroups.

It is indeed redundant with fetch_20newsgroups for 20 newgroups dataset, however, presumably, this function can be used to load any dataset from http://mlcomp.org/datasets (well, according to the docstring). So maybe just raise a deprecation warning if the name_or_id corresponds to the 20newsgroup dataset?

amueller · 2017-03-04T00:44:36Z

this can be closed as the example was removed, right?

rth mentioned this issue Feb 1, 2017

[MRG+1] Remove the MLComp text categorization example #8264

Merged

jnothman closed this as completed Mar 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document classification example: One should use 'latin-1' encoding #8229

Document classification example: One should use 'latin-1' encoding #8229

FrancoisFayard commented Jan 24, 2017

jnothman commented Jan 24, 2017 via email

rth commented Jan 24, 2017 •

edited

Loading

jnothman commented Jan 24, 2017 via email

lesteve commented Jan 24, 2017

rth commented Jan 31, 2017

jnothman commented Jan 31, 2017 via email

lesteve commented Feb 1, 2017 •

edited

Loading

rth commented Feb 1, 2017

amueller commented Mar 4, 2017

Document classification example: One should use 'latin-1' encoding #8229

Document classification example: One should use 'latin-1' encoding #8229

Comments

FrancoisFayard commented Jan 24, 2017

jnothman commented Jan 24, 2017 via email

rth commented Jan 24, 2017 • edited Loading

jnothman commented Jan 24, 2017 via email

lesteve commented Jan 24, 2017

rth commented Jan 31, 2017

jnothman commented Jan 31, 2017 via email

lesteve commented Feb 1, 2017 • edited Loading

rth commented Feb 1, 2017

amueller commented Mar 4, 2017

rth commented Jan 24, 2017 •

edited

Loading

lesteve commented Feb 1, 2017 •

edited

Loading