Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FEA add stratified k-fold iterators for splitting multilabel data #26423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

Charlie-XIAO
Copy link
Contributor

@Charlie-XIAO Charlie-XIAO commented May 24, 2023

Reference Issues/PRs

Towards #25193.

What does this implement/fix? Explain your changes.

This PR adds support for "multilabel-indicator" in StratifiedKFold, which implements the iterative stratification algorithm for multi-label classification (I partially referred to the implementation here). Correspondingly, RepeatedStratifiedKFold would now support "multilabel-indicator" as well. The paper that proposed iterative stratification is this and this video may be helpful for understanding the algorithm.

StratifiedKFold with multi-label target is tested in the following aspects (except the basics):

  • It preserves data ordering as much as possible.
  • It preserves the ratio of positive to negative examples of each label in individual splits.
  • The difference between maximum and minimum test sizes is at most 1.
  • The stratification gives same indices regardless of the labels actually are.
  • Shuffling happens when requested, as for single-label target.

As for documentation:

  • This PR includes a brief example in modules/doc/cross_validation.rst.
  • I have drafted a visualization here based on examples\model_selection\plot_cv_indices.py, though it is not implemented in the PR yet.

Some other comments:

  • [["0", "1"], ["1", "0"]] is considered "multiclass-multioutput" instead of "multiclass-indicator". Is this the desired behavior?
  • I'm thinking that model_selection/tests/test_split.py may need to be refactored... It seems too messy now.

@Charlie-XIAO Charlie-XIAO changed the title Feature New k-fold iterator for multilabel classificaation Feature New k-fold iterator for multilabel classification May 24, 2023
@Charlie-XIAO Charlie-XIAO changed the title Feature New k-fold iterator for multilabel classification Feature MultilabelStratifiedKFold and RepeatedMultilabelStratifiedKFold for splitting multilabel data May 26, 2023
@Charlie-XIAO Charlie-XIAO changed the title Feature MultilabelStratifiedKFold and RepeatedMultilabelStratifiedKFold for splitting multilabel data Feature Stratified k-fold iterators for splitting multilabel data May 26, 2023
@Charlie-XIAO Charlie-XIAO marked this pull request as draft December 14, 2023 13:57
@Charlie-XIAO Charlie-XIAO marked this pull request as ready for review December 17, 2023 06:02
@Charlie-XIAO
Copy link
Contributor Author

@glemaitre Would you have time for a review? I have refactored the code to avoid creating two new classes. Please see also #26423 (comment)

@glemaitre glemaitre self-requested a review January 9, 2024 09:32
@jeremiedbb jeremiedbb modified the milestones: 1.5, 1.6 May 13, 2024
@glemaitre
Copy link
Member

Let's not rush here and move it for next release.

@glemaitre glemaitre modified the milestones: 1.6, 1.7 Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants