Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+2] make_multilabel_classification for large n_features: faster and sparse output support #2828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Feb 5, 2014

This is a more focussed #2773.

make_multilabel_classification is currently very slow for large n_features, and only outputs dense matrices although it effectively generates a sparse structure.

Although this function can certainly be more optimised, the per-sample Python iteration will slow it down I don't think there's much value in further optimisation.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling f66a70e on jnothman:make_multi_fast2 into 74d3fcf on scikit-learn:master.

@jnothman
Copy link
Member Author

jnothman commented Feb 5, 2014

@arjoly, you've basically already reviewed this. I just needed a clean sweep. I hope you don't mind giving it another scan.

@@ -309,7 +316,7 @@ def sample_example():
y = []
while len(y) != n:
# pick a class with probability P(c)
c = generator.multinomial(1, p_c).argmax()
c = np.searchsorted(cumulative_p_c, generator.rand())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a better name for n ?

@jnothman
Copy link
Member Author

Okay. I'll make the style changes you ask for.

n = generator.poisson(n_labels)
n_labels = n_classes + 1
while (not allow_unlabeled and n_labels == 0) or n_labels > n_classes:
n_labels = generator.poisson(n_labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a name clash with the original n_labels, maybe n_nonzero_labels would dot the job.

@jnothman
Copy link
Member Author

What did you want wrt the ECML citation?

@arjoly
Copy link
Member

arjoly commented Feb 12, 2014

What did you want wrt the ECML citation?

Nothing, that was a reference to show that the benchmark I made is sensible.

@jnothman
Copy link
Member Author

Gotcha.

@jnothman
Copy link
Member Author

So does this get your +1, @arjoly?

On 13 February 2014 00:22, Arnaud Joly [email protected] wrote:

What did you want wrt the ECML citation?

Nothing, that was a reference to show that the benchmark I made is
sensible.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2828#issuecomment-34867796
.

@arjoly
Copy link
Member

arjoly commented Feb 12, 2014

+1 :-)
thanks for your work !!!

@jnothman
Copy link
Member Author

Thanks.

On 13 February 2014 08:54, Arnaud Joly [email protected] wrote:

+1 :-)

Reply to this email directly or view it on GitHubhttps://github.com//pull/2828#issuecomment-34922644
.

@@ -284,7 +290,7 @@ def make_multilabel_classification(n_samples=100, n_features=20, n_classes=5,

Returns
-------
X : array of shape [n_samples, n_features]
X : array or sparse CSR matrix of shape [n_samples, n_features]
The generated samples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here the output type should be detailed a bit. Reading this docstring, I might get things wrong. Or maybe an example in the docstring? I am trying to see how we can avoid users getting confused and getting this function wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, would a correct description of this function be that it transforms a multi-label 'y' into a multi-output 'y' by performing a one-hot encoding? It's a description full of jargon, so it shouldn't be the only description, but if it is indeed correct, it might help some users understanding the function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thos PR is about whether the returned X is sparse or not. There's no one-hot encoding. You're right, however, that the Y description is a bit lacking (but the return_indicator parameter isn't terrible).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the return_indicator parameter isn't terrible

I would happily drop support for the support of tuple of list of label.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are open PRs which deprecate that support across the board, including here.

@GaelVaroquaux
Copy link
Member

General comment: it think that this feature should get more examples in the docstring and a bit more discussion in the narrative documentation. It can very well be hard to fit in the mind of someone who is not used to these patterns. Also, it would be easy for a user with these usecases to never find these functions.

Maybe also an additional example in the set of examples? For instance in the applications directory.

@arjoly
Copy link
Member

arjoly commented Mar 24, 2014

The narrative document about sample generators could indeed be improved.

@jnothman
Copy link
Member Author

Seeing as I've only really played with a sparse make_multilabel_classifcation for benchmarking, I'm not sure what to propose as an example. The matter was more that it had a read time scalability issues with large n_features.

I think you're right, though, that the sample generators are complex and magical, and could do with more in the narrative documentation.

@GaelVaroquaux
Copy link
Member

Seeing as I've only really played with a sparse make_multilabel_classifcation
for benchmarking, I'm not sure what to propose as an example.

Topic predictions in documents? Do we have a multi-label text dataset?

The reason that I think that an example is very important is that it will
help us identify whether our API makes sens in an application perspective
or not. I find 'example-driven-development' to be very important.

@jnothman
Copy link
Member Author

Do we have a multi-label text dataset?

We do, and there are no shortage of them available.

But make_multilabel_classification only very roughly approximates
something like this, because:

  • words should be distributed according to a power law
  • the artificial classes being generated don't share a "base vocabulary"
  • words shouldn't be sampled from each label uniformly given a document and
    its set of labels.

So I think part of the answer is: make_multilabel_classification doesn't
well reflect a real-world problem. I guess that means it doesn't make for a
very useful learning benchmark. Should we be trying to generate something
that looks more like real text data? How would we know that it does??

This PR doesn't aim to change that functionality, but at least makes it
operate in reasonable time given reasonable parameters.

On 24 March 2014 21:35, Gael Varoquaux [email protected] wrote:

Seeing as I've only really played with a sparse
make_multilabel_classifcation
for benchmarking, I'm not sure what to propose as an example.

Topic predictions in documents? Do we have a multi-label text dataset?

The reason that I think that an example is very important is that it will
help us identify whether our API makes sens in an application perspective
or not. I find 'example-driven-development' to be very important.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2828#issuecomment-38429831
.

@arjoly
Copy link
Member

arjoly commented Apr 1, 2014

General comment: it think that this feature should get more examples in the docstring and a bit more discussion in the narrative documentation. It can very well be hard to fit in the mind of someone who is not used to these patterns. Also, it would be easy for a user with these usecases to never find these functions.

Would it be enough to add one more example in the docstring?

Maybe also an additional example in the set of examples? For instance in the applications directory.

There is already one example using this dataset. Maybe this could be done in #3001.

This PR doesn't aim to change that functionality, but at least makes it
operate in reasonable time given reasonable parameters.

+1 Let's stay focus.

@arjoly arjoly mentioned this pull request May 19, 2014
3 tasks
@arjoly
Copy link
Member

arjoly commented May 19, 2014

any last review?

@arjoly
Copy link
Member

arjoly commented Jun 30, 2014

@hamsal Here you have a fast version of make_multilabel_classification for your benchmarks.

@hamsal hamsal mentioned this pull request Jun 30, 2014
6 tasks
@ogrisel
Copy link
Member

ogrisel commented Jul 2, 2014

I did some quick histogram plots of features generated by this PR vs master and they look the same. +1 for merge!

@jnothman
Copy link
Member Author

jnothman commented Jul 2, 2014

Are you saying they generate the same features? That's surprising to me? Or are you saying they generate the same character of dataset? Yes, they should. Whether it's a meaningful dataset is another matter. Thanks for the +1.

@ogrisel
Copy link
Member

ogrisel commented Jul 2, 2014

Not exactly the same distribution but similarly looking histograms.

@arjoly
Copy link
Member

arjoly commented Jul 8, 2014

Thus we are at +2 for merged.

@arjoly arjoly changed the title [MRG+1] make_multilabel_classification for large n_features: faster and sparse output support [MRG+2] make_multilabel_classification for large n_features: faster and sparse output support Jul 8, 2014
@arjoly
Copy link
Member

arjoly commented Jul 8, 2014

Merged by rebase.
Thanks @jnothman !

@arjoly arjoly closed this Jul 8, 2014
@arjoly
Copy link
Member

arjoly commented Jul 8, 2014

I should have check the build before pushing. :-/
I will send a fix.

@arjoly
Copy link
Member

arjoly commented Jul 8, 2014

I have pushed a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants