[MRG+2] make_multilabel_classification for large n_features: faster and sparse output support #2828

jnothman · 2014-02-05T01:42:14Z

This is a more focussed #2773.

make_multilabel_classification is currently very slow for large n_features, and only outputs dense matrices although it effectively generates a sparse structure.

Although this function can certainly be more optimised, the per-sample Python iteration will slow it down I don't think there's much value in further optimisation.

…parse output support

coveralls · 2014-02-05T01:50:38Z

Coverage remained the same when pulling f66a70e on jnothman:make_multi_fast2 into 74d3fcf on scikit-learn:master.

jnothman · 2014-02-05T05:48:30Z

@arjoly, you've basically already reviewed this. I just needed a clean sweep. I hope you don't mind giving it another scan.

arjoly · 2014-02-11T07:15:49Z

sklearn/datasets/samples_generator.py

@@ -309,7 +316,7 @@ def sample_example():
        y = []
        while len(y) != n:
            # pick a class with probability P(c)
-            c = generator.multinomial(1, p_c).argmax()
+            c = np.searchsorted(cumulative_p_c, generator.rand())


Could you provide a better name for n ?

jnothman · 2014-02-11T08:08:43Z

Okay. I'll make the style changes you ask for.

arjoly · 2014-02-12T12:58:57Z

sklearn/datasets/samples_generator.py

-            n = generator.poisson(n_labels)
+        n_labels = n_classes + 1
+        while (not allow_unlabeled and n_labels == 0) or n_labels > n_classes:
+            n_labels = generator.poisson(n_labels)


You have a name clash with the original n_labels, maybe n_nonzero_labels would dot the job.

jnothman · 2014-02-12T13:13:17Z

What did you want wrt the ECML citation?

arjoly · 2014-02-12T13:22:28Z

What did you want wrt the ECML citation?

Nothing, that was a reference to show that the benchmark I made is sensible.

jnothman · 2014-02-12T19:53:39Z

Gotcha.

jnothman · 2014-02-12T20:16:44Z

So does this get your +1, @arjoly?

On 13 February 2014 00:22, Arnaud Joly [email protected] wrote:

What did you want wrt the ECML citation?

Nothing, that was a reference to show that the benchmark I made is
sensible.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2828#issuecomment-34867796
.

arjoly · 2014-02-12T21:54:39Z

+1 :-)
thanks for your work !!!

jnothman · 2014-02-12T21:59:50Z

Thanks.

On 13 February 2014 08:54, Arnaud Joly [email protected] wrote:

+1 :-)

Reply to this email directly or view it on GitHubhttps://github.com//pull/2828#issuecomment-34922644
.

GaelVaroquaux · 2014-03-24T09:34:28Z

sklearn/datasets/samples_generator.py

@@ -284,7 +290,7 @@ def make_multilabel_classification(n_samples=100, n_features=20, n_classes=5,

    Returns
    -------
-    X : array of shape [n_samples, n_features]
+    X : array or sparse CSR matrix of shape [n_samples, n_features]
        The generated samples.


Maybe here the output type should be detailed a bit. Reading this docstring, I might get things wrong. Or maybe an example in the docstring? I am trying to see how we can avoid users getting confused and getting this function wrong.

By the way, would a correct description of this function be that it transforms a multi-label 'y' into a multi-output 'y' by performing a one-hot encoding? It's a description full of jargon, so it shouldn't be the only description, but if it is indeed correct, it might help some users understanding the function.

Thos PR is about whether the returned X is sparse or not. There's no one-hot encoding. You're right, however, that the Y description is a bit lacking (but the return_indicator parameter isn't terrible).

but the return_indicator parameter isn't terrible

I would happily drop support for the support of tuple of list of label.

There are open PRs which deprecate that support across the board, including here.

GaelVaroquaux · 2014-03-24T09:47:36Z

General comment: it think that this feature should get more examples in the docstring and a bit more discussion in the narrative documentation. It can very well be hard to fit in the mind of someone who is not used to these patterns. Also, it would be easy for a user with these usecases to never find these functions.

Maybe also an additional example in the set of examples? For instance in the applications directory.

arjoly · 2014-03-24T09:50:49Z

The narrative document about sample generators could indeed be improved.

jnothman · 2014-03-24T09:53:37Z

Seeing as I've only really played with a sparse make_multilabel_classifcation for benchmarking, I'm not sure what to propose as an example. The matter was more that it had a read time scalability issues with large n_features.

I think you're right, though, that the sample generators are complex and magical, and could do with more in the narrative documentation.

GaelVaroquaux · 2014-03-24T10:35:29Z

Seeing as I've only really played with a sparse make_multilabel_classifcation
for benchmarking, I'm not sure what to propose as an example.

Topic predictions in documents? Do we have a multi-label text dataset?

The reason that I think that an example is very important is that it will
help us identify whether our API makes sens in an application perspective
or not. I find 'example-driven-development' to be very important.

jnothman · 2014-03-24T10:52:03Z

Do we have a multi-label text dataset?

We do, and there are no shortage of them available.

But make_multilabel_classification only very roughly approximates
something like this, because:

words should be distributed according to a power law
the artificial classes being generated don't share a "base vocabulary"
words shouldn't be sampled from each label uniformly given a document and
its set of labels.

So I think part of the answer is: make_multilabel_classification doesn't
well reflect a real-world problem. I guess that means it doesn't make for a
very useful learning benchmark. Should we be trying to generate something
that looks more like real text data? How would we know that it does??

This PR doesn't aim to change that functionality, but at least makes it
operate in reasonable time given reasonable parameters.

On 24 March 2014 21:35, Gael Varoquaux [email protected] wrote:

Seeing as I've only really played with a sparse
make_multilabel_classifcation
for benchmarking, I'm not sure what to propose as an example.

Topic predictions in documents? Do we have a multi-label text dataset?

The reason that I think that an example is very important is that it will
help us identify whether our API makes sens in an application perspective
or not. I find 'example-driven-development' to be very important.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2828#issuecomment-38429831
.

arjoly · 2014-04-01T06:46:08Z

General comment: it think that this feature should get more examples in the docstring and a bit more discussion in the narrative documentation. It can very well be hard to fit in the mind of someone who is not used to these patterns. Also, it would be easy for a user with these usecases to never find these functions.

Would it be enough to add one more example in the docstring?

Maybe also an additional example in the set of examples? For instance in the applications directory.

There is already one example using this dataset. Maybe this could be done in #3001.

This PR doesn't aim to change that functionality, but at least makes it
operate in reasonable time given reasonable parameters.

+1 Let's stay focus.

arjoly · 2014-05-19T07:01:41Z

any last review?

arjoly · 2014-06-30T15:24:20Z

@hamsal Here you have a fast version of make_multilabel_classification for your benchmarks.

ogrisel · 2014-07-02T09:28:56Z

I did some quick histogram plots of features generated by this PR vs master and they look the same. +1 for merge!

jnothman · 2014-07-02T10:52:50Z

Are you saying they generate the same features? That's surprising to me? Or are you saying they generate the same character of dataset? Yes, they should. Whether it's a meaningful dataset is another matter. Thanks for the +1.

ogrisel · 2014-07-02T12:05:19Z

Not exactly the same distribution but similarly looking histograms.

arjoly · 2014-07-08T09:11:43Z

Thus we are at +2 for merged.

arjoly · 2014-07-08T09:18:44Z

Merged by rebase.
Thanks @jnothman !

arjoly · 2014-07-08T10:16:23Z

I should have check the build before pushing. :-/
I will send a fix.

arjoly · 2014-07-08T10:33:00Z

I have pushed a patch.

ENH make_multilabel_classification for large n_features: faster and s…

f66a70e

…parse output support

arjoly reviewed Feb 11, 2014
View reviewed changes

arjoly reviewed Feb 12, 2014
View reviewed changes

COSMIT in response to @arjoly's comments

2fd60d7

GaelVaroquaux reviewed Mar 24, 2014
View reviewed changes

jnothman mentioned this pull request Mar 25, 2014

[MRG] improve documentation on sample generators #3001

Merged

arjoly mentioned this pull request May 19, 2014

[MRG] AdaBoost Sparse Input Support #3161

Merged

3 tasks

hamsal mentioned this pull request Jun 30, 2014

[MRG+1] Sparse One vs. Rest #3276

Closed

6 tasks

arjoly changed the title ~~[MRG+1] make_multilabel_classification for large n_features: faster and sparse output support~~ [MRG+2] make_multilabel_classification for large n_features: faster and sparse output support Jul 8, 2014

arjoly closed this Jul 8, 2014

Uh oh!

[MRG+2] make_multilabel_classification for large n_features: faster and sparse output support #2828

[MRG+2] make_multilabel_classification for large n_features: faster and sparse output support #2828

Uh oh!

Conversation

jnothman commented Feb 5, 2014

Uh oh!

coveralls commented Feb 5, 2014

Uh oh!

jnothman commented Feb 5, 2014

Uh oh!

arjoly Feb 11, 2014

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 11, 2014

Uh oh!

arjoly Feb 12, 2014

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 12, 2014

Uh oh!

arjoly commented Feb 12, 2014

Uh oh!

jnothman commented Feb 12, 2014

Uh oh!

jnothman commented Feb 12, 2014

Uh oh!

arjoly commented Feb 12, 2014

Uh oh!

jnothman commented Feb 12, 2014

Uh oh!

GaelVaroquaux Mar 24, 2014

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux Mar 24, 2014

Choose a reason for hiding this comment

Uh oh!

jnothman Mar 24, 2014

Choose a reason for hiding this comment

Uh oh!

arjoly Apr 25, 2014

Choose a reason for hiding this comment

Uh oh!

jnothman Apr 26, 2014

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Mar 24, 2014

Uh oh!

arjoly commented Mar 24, 2014

Uh oh!

jnothman commented Mar 24, 2014

Uh oh!

GaelVaroquaux commented Mar 24, 2014

Uh oh!

jnothman commented Mar 24, 2014

Uh oh!

arjoly commented Apr 1, 2014

Uh oh!

arjoly commented May 19, 2014

Uh oh!

arjoly commented Jun 30, 2014

Uh oh!

ogrisel commented Jul 2, 2014

Uh oh!

jnothman commented Jul 2, 2014

Uh oh!

ogrisel commented Jul 2, 2014

Uh oh!

arjoly commented Jul 8, 2014

Uh oh!

arjoly commented Jul 8, 2014

Uh oh!

arjoly commented Jul 8, 2014

Uh oh!

arjoly commented Jul 8, 2014

Uh oh!

Uh oh!