Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@leonardbinet
Copy link
Contributor

@leonardbinet leonardbinet commented Sep 1, 2019

Reference Issues/PRs

Fixes #14860
Fixes #18611
Closes #23569

What does this implement/fix? Explain your changes.

This fixes the "sklearn.utils.multiclass.type_of_target" function for sparse matrices.

@leonardbinet leonardbinet changed the title fix type_of_target for csr_matrices [WIP] fix type_of_target for csr_matrices Sep 1, 2019
@leonardbinet leonardbinet changed the title [WIP] fix type_of_target for csr_matrices [MRG] fix type_of_target for csr_matrices Sep 1, 2019
@leonardbinet leonardbinet force-pushed the fix_type_of_target branch 2 times, most recently from f212400 to 6203a53 Compare September 1, 2019 12:57
@leonardbinet leonardbinet force-pushed the fix_type_of_target branch 2 times, most recently from f26476b to 3c31304 Compare September 9, 2019 22:15
@leonardbinet
Copy link
Contributor Author

@jnothman is there any action expected on my side to merge this PR?

@jnothman
Copy link
Member

No, there is just a lot of competing demand on reviewers' time. Thanks for pinging

@rth rth self-requested a review July 25, 2020 09:59
@cmarmo
Copy link
Contributor

cmarmo commented Aug 17, 2020

Hi @leonardbinet, thanks for your patience! Do you mind fixing conflicts? Hopefully, this will bring some attention again. Thanks!

@alk-lbinet
Copy link
Contributor

@cmarmo here it is :)

@cmarmo
Copy link
Contributor

cmarmo commented Sep 6, 2020

Thanks @alk-lbinet . It seems to me that the failing check is unrelated to this PR. Perhaps @rth will find some time to review? Thanks!

@glemaitre glemaitre self-requested a review October 19, 2020 20:20
@glemaitre glemaitre changed the title [MRG] fix type_of_target for csr_matrices ENH Support CSR matrix in type_of_target Oct 20, 2020
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update the docstring:

Parameters
----------
y : {array-like, sparse matrix}
    The target array. If a sparse matrix, `y` is expected to be a
    CSR matrix.

@glemaitre
Copy link
Member

We would also need an entry in what's new to announce that we support CSR matrices.

@eddiebergman
Copy link
Contributor

Bump, we currently accept sparse labels into autosklearn but we have to de-sparsify them to use type_of_target.

@leonardbinet would you be continuing this PR? If not, I am happy to try finish and push this feature through next week.

@ilivans
Copy link

ilivans commented Feb 11, 2022

@leonardbinet can I get access to push this one, please?

@leonardbinet
Copy link
Contributor Author

Hi, @ilivans I gave you access, feel free to update my branch 👍

@ilivans
Copy link

ilivans commented Feb 12, 2022

I think @GuoqiangOu 's questions come down to the next one:

  • why type_of_target([[0, 0.5]]) == 'continuous-multioutput' and not 'continuous'?
    • it contradicts with the documentation indeed, as well as for 'multiclass-multioutput', however it has little to do with sparse matrices so it's outside of the PR's scope AFAIC, maybe it's worth creating an issue and fixing the documentation (or the behavior if necessary)

Another edge case that I found is

  • type_of_target([[-1, 1]]) != type_of_target(csr_matrix([[-1, 1]])) ('multilable-indicator' != 'multiclass-multioutput')
    • seemingly the same matrix in different formats turns out to be of different types
    • this has to do with the assumption that sparse matrices have 0 as an implicit negative value, so the last matrix actually have 3 values (0, -1 and 1). I'm not sure it's worth documenting though, it seems to be common sense. I would actually make 0 the only acceptable negative value for multilabel.

@ilivans
Copy link

ilivans commented Feb 12, 2022

In the last commits I addressed the comments regarding documentation, csc_matrix and explicit zeros.

It seems to be done to me. Please take a look, somebody 🙏

cc @glemaitre @rth

@jnothman
Copy link
Member

why type_of_target([[0, 0.5]]) == 'continuous-multioutput' and not 'continuous'?

It has two columns.

this has to do with the assumption that sparse matrices have 0 as an implicit negative value, so the last matrix actually have 3 values (0, -1 and 1). I'm not sure it's worth documenting though, it seems to be common sense. I would actually make 0 the only acceptable negative value for multilabel.

Is there sense in rejecting a full "sparse" matrix with no zeros and two nonzero values?

-1 has been a longstanding label for "negative" with thanks to support vector machines at least.

@ilivans
Copy link

ilivans commented Feb 12, 2022

thanks @jnothman 🙌

why type_of_target([[0, 0.5]]) == 'continuous-multioutput' and not 'continuous'?

It has two columns.

It does, however the documentation (of the function) says

"'continuous-multioutput': y is a 2d array of floats that are
not all integers, and both dimensions are of size > 1."

If it was a (2,1) matrix (a column) it would be treated as 'continuous'.

this has to do with the assumption that sparse matrices have 0 as an implicit negative value, so the last matrix actually have 3 values (0, -1 and 1). I'm not sure it's worth documenting though, it seems to be common sense. I would actually make 0 the only acceptable negative value for multilabel.

Is there sense in rejecting a full "sparse" matrix with no zeros and two nonzero values?

-1 has been a longstanding label for "negative" with thanks to support vector machines at least.

It's a good question. I tried to give my opinion on this, I think it makes sense to reject such cases because of the underlying assumption of 0 being the "missing" value (there is a method csr_matrix.eliminate_zeros), so if you're using a sparse matrix as an indicator matrix, it should contain 1 unique value that's treated as positive. If you're using a sparse matrix with -1 and 1, you're misusing the format AFAIC.

It's easy to change the logic tho, so just lmk if you believe it's necessary to make it consistent with the dense case.

@ilivans
Copy link

ilivans commented Feb 12, 2022

btw I can't change the PR description, but it also fixes #18611 now

@ilivans
Copy link

ilivans commented Feb 12, 2022

Is there sense in rejecting a full "sparse" matrix with no zeros and two nonzero values?

I have to add, this logic hasn't been introduced by the PR, the PR just fixes exceptions, the logic was introduced in 2bfca14c4935eb524cfd7a65dec2370c1f03c857 (ENH sparse matrix support in label binarization):

    'multilabel-indicator': [
        ...
        csr_matrix(np.array([[0, 1]])),
        # Only valid when data is dense
        np.array([[-1, 1], [1, -1]]),
        np.array([[-3, 3], [3, -3]]),
    ],

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I synced with main and added a what's new entry.

In addition to the exhaustive added tests, I checked that it fixes the 2 reported issues, and does not break any existing behavior. Let's merge.

Thanks @leonardbinet and everybody else for the help and feedback.

@jeremiedbb jeremiedbb merged commit ae943bd into scikit-learn:main Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH type_of_target raises unhandled TypeError for sparse matrices sklearn.utils.multiclass.type_of_target with sparse csr matrix raises ValueError

10 participants