-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] make_multilabel_classification for large n_features: faster and sparse output support #2828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…parse output support
@arjoly, you've basically already reviewed this. I just needed a clean sweep. I hope you don't mind giving it another scan. |
@@ -309,7 +316,7 @@ def sample_example(): | |||
y = [] | |||
while len(y) != n: | |||
# pick a class with probability P(c) | |||
c = generator.multinomial(1, p_c).argmax() | |||
c = np.searchsorted(cumulative_p_c, generator.rand()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a better name for n ?
Okay. I'll make the style changes you ask for. |
n = generator.poisson(n_labels) | ||
n_labels = n_classes + 1 | ||
while (not allow_unlabeled and n_labels == 0) or n_labels > n_classes: | ||
n_labels = generator.poisson(n_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a name clash with the original n_labels
, maybe n_nonzero_labels
would dot the job.
What did you want wrt the ECML citation? |
Nothing, that was a reference to show that the benchmark I made is sensible. |
Gotcha. |
So does this get your +1, @arjoly? On 13 February 2014 00:22, Arnaud Joly [email protected] wrote:
|
+1 :-) |
Thanks. On 13 February 2014 08:54, Arnaud Joly [email protected] wrote:
|
@@ -284,7 +290,7 @@ def make_multilabel_classification(n_samples=100, n_features=20, n_classes=5, | |||
|
|||
Returns | |||
------- | |||
X : array of shape [n_samples, n_features] | |||
X : array or sparse CSR matrix of shape [n_samples, n_features] | |||
The generated samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here the output type should be detailed a bit. Reading this docstring, I might get things wrong. Or maybe an example in the docstring? I am trying to see how we can avoid users getting confused and getting this function wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, would a correct description of this function be that it transforms a multi-label 'y' into a multi-output 'y' by performing a one-hot encoding? It's a description full of jargon, so it shouldn't be the only description, but if it is indeed correct, it might help some users understanding the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thos PR is about whether the returned X
is sparse or not. There's no one-hot encoding. You're right, however, that the Y
description is a bit lacking (but the return_indicator
parameter isn't terrible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the return_indicator parameter isn't terrible
I would happily drop support for the support of tuple of list of label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are open PRs which deprecate that support across the board, including here.
General comment: it think that this feature should get more examples in the docstring and a bit more discussion in the narrative documentation. It can very well be hard to fit in the mind of someone who is not used to these patterns. Also, it would be easy for a user with these usecases to never find these functions. Maybe also an additional example in the set of examples? For instance in the applications directory. |
The narrative document about sample generators could indeed be improved. |
Seeing as I've only really played with a sparse I think you're right, though, that the sample generators are complex and magical, and could do with more in the narrative documentation. |
Topic predictions in documents? Do we have a multi-label text dataset? The reason that I think that an example is very important is that it will |
We do, and there are no shortage of them available. But
So I think part of the answer is: make_multilabel_classification doesn't This PR doesn't aim to change that functionality, but at least makes it On 24 March 2014 21:35, Gael Varoquaux [email protected] wrote:
|
Would it be enough to add one more example in the docstring?
There is already one example using this dataset. Maybe this could be done in #3001.
+1 Let's stay focus. |
any last review? |
@hamsal Here you have a fast version of |
I did some quick histogram plots of features generated by this PR vs master and they look the same. +1 for merge! |
Are you saying they generate the same features? That's surprising to me? Or are you saying they generate the same character of dataset? Yes, they should. Whether it's a meaningful dataset is another matter. Thanks for the +1. |
Not exactly the same distribution but similarly looking histograms. |
Thus we are at +2 for merged. |
Merged by rebase. |
I should have check the build before pushing. :-/ |
I have pushed a patch. |
This is a more focussed #2773.
make_multilabel_classification
is currently very slow for largen_features
, and only outputs dense matrices although it effectively generates a sparse structure.Although this function can certainly be more optimised, the per-sample Python iteration will slow it down I don't think there's much value in further optimisation.