-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Sparse multilabel target support in metrics #3395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
great ! I will review this pr :-) when I ditch some time. |
def _weight(self, X): | ||
print(X) | ||
if self.sample_weight is not None: | ||
print(X * self.sample_weight[:, None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They change the travis output in an interesting way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and hidden from history!
Travis tests are not really failing here, it's just that the output gets truncated. At the very least they work on my box. |
if labels is None: | ||
labels = unique_labels(y_true, y_pred) | ||
if binarize: | ||
binarizer = MultiLabelBinarizer([labels]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can (and I guess should) use the sparse_output=True
param which has been merged to master in the meanwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... currently the binarize
option isn't used (or tested, indeed), but you're right that it would better produce sparse output. I can just remove it. Or as @arjoly suggests, get rid of _SequencesMultilabelHelper
and use it always. Perhaps that deserves a benchmark.
A few comments:
|
FWIW, this is benchmark without turning everything into sparse matrix:
This is with:
We're dealing with small numbers apart from the sequences case, which is being deprecated, but is substantially faster without binarizing, so seeing as the implementation is here and tested, we might as well make it fast until deprecation is complete. Making dense data sparse adds ~.04s, but we're currently dealing with density settings that may favour sparse. We could experiment with varying those parameters, but I don't see the point. |
Thanks for the benchmark ! |
Thanks Joel,
|
If I understand the state of play, #3276 only makes LabelBinarizer sparse On 17 July 2014 17:08, Vlad Niculae [email protected] wrote:
|
+1 for a benchmark ! |
Beautiful implementation of a generic pattern. It certainly makes the code in metrics much more readable. However, I am a bit worried that such patterns require some learning to be able to read the codebase, and will make it harder for people without a lot of expertise to maintain the codebase. My gut feeling, and it's only a gut feeling, is that we could try to have a set of functions that implement the methods that you created, but as functions, not as methods. This means that the routing of the genericity would be done inside the function. I find it hard to know beforehand if it is actually going to result in more readable code or not. Would you bare with me, and try to implement this approach? Maybe in a separate PR to compare (I don't know if the separate PR is a good or a bad idea). |
As an intuition, as long as we are handling formats that require The other option that avoids a helper class is to put all the metric On 20 July 2014 02:32, Gael Varoquaux [email protected] wrote:
|
I think 3 classes with 2 out of 3 filled may make for an appropriate dense vs convert-to-sparse benchmark:
This says nothing about memory, but in time the conversion has a cost, but runtime is still very small, so I could consider removing dense support. @GaelVaroquaux, I could convert everything to sparse matrices and simplify the code a lot (unless we wanted a similar polymorphism for the multiclass case), particularly in using functions rather than methods of a helper class. This would mean slower sequences of sequences support, as shown above, but perhaps that is incentive for people to heed the |
@arjoly, damn your cruel refactoring... |
Hopefully I correctly edited that rebase |
sorry :-( |
I am fine with slower sequence of sequence. |
@GaelVaroquaux, I think you'll much prefer this version where everything is calculated over CSR matrices... I am only concerned that the |
Rebased. |
Performance looks great ! Through I have a high discrepancy for some metrics between master and this pr. |
Awesome pr !!! Could you update the docstrings and the narrative doc to highlight your work? |
The extra call for unique labels is fine to me. |
Fair point. I'd better go looking for things to change... |
Do you think each docstring needs to specify "or sparse/dense label indicator matrix"? |
I would do something in the spirit of what we have done for sparse one versus rest. |
I see there you use Currently, though, we say "array-like or label indicator matrix". What we want to say is something like: "1d array-like, or label indicator array / sparse matrix". I.e. "{array-like 1d,array of 1s and 0s,sparse matrix of 1s}" |
code is awesome !
If it holds on 80 character and contains shape, I am +1 for improved doctring. |
Something else I now realise is missing here is sparse support in LRAP |
If you are at it, I would be happy to see sparse input support for lrap. |
Pushed changes to support sparse matrix in LRAP, and to improve documentation. |
All multilabel metric calculation is now performed efficiently over sparse CSR matrices.
Assuming Travis is happy, I think this is where we want it to be. Votes for merge? @arjoly? @GaelVaroquaux? |
Thanks for the lrap metric!!! |
You get my +1 |
Thanks @arjoly |
A last reviewer ? |
It should probably be updated to take into account the sample weight support in the jaccard metric. |
What's to update? |
Apparently nothing, sorry. I should have checked the implementation. |
This introduces a series of helper classes that abstract away aggregation over multilabel structures. This enables efficient calculation in sparse and dense binary indicator matrices, while maintaining support for the deprecated sequences format.