Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Document classification example: One should use 'latin-1' encoding #8229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
FrancoisFayard opened this issue Jan 24, 2017 · 9 comments
Closed

Comments

@FrancoisFayard
Copy link

There seems to be a bug in the documentation of the classification of text documents: http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#sphx-glr-auto-examples-text-mlcomp-sparse-document-classification-py

The files are opened as utf-8 which leads to a bug. I have solved the issue changing "open(f)" into "open(f, encoding='latin1')".

@jnothman
Copy link
Member

jnothman commented Jan 24, 2017 via email

@rth
Copy link
Member

rth commented Jan 24, 2017

Yes, but open does not accept an encoding parameter in Python 2; maybe initializing TfidfVectorizer with the input="filename" and letting it handle the encoding would be better?

Also what's the difference between the MLComp 20 newsgoups in this example and the 20 newgroups dataset used in the other text classification example obtained with fetch_20newsgroups. Isn't that the same dataset?

@jnothman
Copy link
Member

jnothman commented Jan 24, 2017 via email

@lesteve
Copy link
Member

lesteve commented Jan 24, 2017

MLComp 20 newsgoups in this example and the 20 newgroups dataset used in the other text classification example obtained with `fetch_20newsgroups

I was wondering exactly the same thing and I don't know the answer.

@rth
Copy link
Member

rth commented Jan 31, 2017

@lesteve It looks like it is almost the same dataset, the MLComp 20 newsgroup example indicates that it uses the 20news-18828.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/. This corresponds to

20news-18828.tar.gz 20 Newsgroups; duplicates removed, only "From" and "Subject" headers (18828 documents)

while the original Newsgroup dataset which can be obtained with sklearn.datasets.fetch_20newsgroups, is

20news-19997.tar.gz - Original 20 Newsgroups data set (18846 documents)

where headers can be removed with remove=(headers,).

I don't think that a few duplicates and differences in the kept header fields are fundamental to demonstrate text categorization in this example. Which makes the the MLComp 20 newsgroup example somewhat redundant with the other text classification example. Maybe the the MLComp 20 newsgroup example should be rewritten to use fetch_20newsgroups or made less redundant?

Also it looks like neither of those examples are run in the example gallery, is it because they need to download an external dataset and take longer to run?

@jnothman
Copy link
Member

jnothman commented Jan 31, 2017 via email

@lesteve
Copy link
Member

lesteve commented Feb 1, 2017

Maybe the the MLComp 20 newsgroup example should be rewritten to use fetch_20newsgroups or made less redundant?

I would remove the MLComp 20 newsgroup example. Quickly looking at the source code both examples are very similar. I would be in favour of deprecating load_mlcomp which is not used anywhere else and is very cumbersome to use (need to manually download the data and set an environment variable) and does not have any added value compared to fetch_20newsgroups.

Also it looks like neither of those examples are run in the example gallery, is it because they need to download an external dataset and take longer to run?

Another thing worth investigating would be to look at the examples that are not run (i.e. the ones not starting with plot_) and checking how long they take to run. I tried a few and they were not taking too much time. Renaming these examples to start with plot_ would make them better tested and more accessible through a gallery thumbnail.

@rth
Copy link
Member

rth commented Feb 1, 2017

I would remove the MLComp 20 newsgroup example. Quickly looking at the source code both examples are very similar.

OK, I can make a PR for it.

I would be in favour of deprecating load_mlcomp which is not used anywhere else and is very cumbersome to use (need to manually download the data and set an environment variable) and does not have any added value compared to fetch_20newsgroups.

It is indeed redundant with fetch_20newsgroups for 20 newgroups dataset, however, presumably, this function can be used to load any dataset from http://mlcomp.org/datasets (well, according to the docstring). So maybe just raise a deprecation warning if the name_or_id corresponds to the 20newsgroup dataset?

@amueller
Copy link
Member

amueller commented Mar 4, 2017

this can be closed as the example was removed, right?

@jnothman jnothman closed this as completed Mar 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants