Add `strata` to `Stratified*Split` CV splitters #26821

adrinjalali · 2023-07-12T03:05:50Z

Right now Stratified*Split classes take y as the strata, while that's not always the case.

In train_test_split we allow a stratify arg (which I'm wondering if it should be called strata), which defines the groups samples belong to. And inside, we basically do this:

    cv = StratifiedShuffleSplit(test_size=n_test, train_size=n_train, random_state=random_state)

    train, test = next(cv.split(X=arrays[0], y=stratify))

    return list(
        chain.from_iterable(
            (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays
        )
    )

As you can see, we're passing stratify as y to the splitter. I think it would make sense to add a strata arg to the splitters, and if None, we'd take values in y instead, as it is now.

Note that now that we have SLEP6, we won't need a separate class for them, and we'd simply request strata for the splitter:

cv = StratifiedShuffleSplit().set_split_request(strata=True)
...
GridSearchCV(model, param_grid, cv=cv).fit(X, y, strata=strata_values)
cross_validate(model, X, y, cv=cv, props={"strata": strata_values})

cc @marenwestermann

The text was updated successfully, but these errors were encountered:

adarsh-meher · 2023-07-14T11:13:01Z

Hi @adrinjalali . I am new to open source contributions and looking for an opportunity to get started with it . Let me know if I can work on this issue .

adrinjalali · 2023-07-14T12:42:11Z

We still don't have a consensus here on how to proceed @adarsh-meher , It's better to work on issues which are marked as "help wanted" and "good first issue".

thomasjpfan · 2023-07-24T13:50:07Z

All the Stratified*Split CVs assume that is it stratifying on y. In the ML context, I think this is the common use of the term "stratify". If there needs to be other "grouping information" then it becomes a "Group*CV".

train_test_split needs a stratify parameter, because it can split any number of arrays, so there is no way to tell which array is the target. Looking through the codebase, only resample and train_test_split has a stratify parameter because of the same reason.

Currently, I am -1 on the proposal of adding strata to Stratified*Split CVs.

adrinjalali · 2023-07-24T14:59:13Z

@thomasjpfan Group*CV and Stratified*CV are pretty much on the opposite side of one another. One makes sure of each group to be exactly in one of train and test sets, the other makes sure each group is split proportionately in train and test sets.

Imagine you have a dataset where one attribute is the state (location) of the event, and due to certain issues (laws / etc) you want to have a similar proportions of all states' data in train and test, then you stratify on state rather than output.

Another example is when you have train / flight data, and each route has its own peculiarities, but you want to train on the whole dataset rather than a single route, while having enough data from each route on train and test; here you'd stratify on "route", while y might be "delay".

I'm happy to move this to Group*CV, but then we'd need to add a parameter to decide whether the split needs to keep each group in one of train / test, or to keep them proportionate for each train / test.

j-at-ch · 2023-07-24T16:37:10Z

Just a quick comment from a regular user:

+1 can confirm that the proposed feature has a use case that's prevalent in healthcare: stratifying across protected characteristics such as demographics (usually different from the target). (Note we'd usually also want to group by patient).

Also would humbly agree strata is a more informative name than stratify.

thomasjpfan · 2023-07-24T18:25:55Z

I see the use cases and now +1 on introducing strata to the splitters.

ogrisel · 2024-12-05T18:16:59Z

I'm happy to move this to Group*CV, but then we'd need to add a parameter to decide whether the split needs to keep each group in one of train / test, or to keep them proportionate for each train / test.

The purpose of the Group*CV is the opposite: generalize across groups. So I would not imply that stratification is related to what we currently name "group*" splitters.

However, some people might want left-out groupwise splitting on some metadata and stratification on some other metadata, no?

adrinjalali · 2024-12-06T09:26:54Z

For me, this is what I'd aim for I think:

cv = KFold(shuffle=True, n_folds=5).set_split_request(groups=True, strata=True)

GridSearchCV(..., cv=cv).fit(X, y, strata=strata_var, groups=groups_var)

# or
cross_validate(..., cv=cv, params={"groups": groups_var, "strata": strata_var})

ogrisel · 2024-12-06T16:36:59Z

While I am not fundamentally opposed to this feature, I wonder what concrete need it fulfills and what precise statistical question it answers. If we do decide to implement it, we should start with a convincing practical usage example that highlights how the lack of stratification in the CV evaluation loop would have otherwise led to mistaken conclusions.

Disclaimer: at this time I do not see how to come up with such an example, but I would love to learn something.

adrinjalali · 2025-01-02T10:46:33Z

@j-at-ch do you have something at hand as an example to answer @ogrisel ? I don't have a dataset right now, I've only used it as you say in healthcare related work and seen it in other areas as well, but no dataset at hand.

@ogrisel, on the other hand, you can believe us when we say this is what we do / have done in practice.

ogrisel · 2025-01-02T14:16:00Z

@ogrisel, on the other hand, you can believe us when we say this is what we do / have done in practice.

It's really not a problem of trust but more of a problem of transparency and education. We should not offer newt tools for our users to shoot themselves in the foot without a clear notice that documents proper use and known pitfalls.

lorentzenchr · 2025-01-03T10:59:44Z

+1 from me. Also without a dataset. The interesting ones are most not public.

The arg should be named something like stratum or stratum_id (or strata). Ideally, the default is such that for classification it uses the target classes/labels.
As I would rather prefer to reduce the amount of splitters, I‘d favor to introduce it as new arg instead of a new splitter.

We should not offer newt tools for our users to shoot themselves in the foot without a clear notice that documents proper use and known pitfalls.

I don’t think we should be overprotective like that.

ogrisel · 2025-01-03T16:30:23Z

I don’t think we should be overprotective like that.

I sincerely believe that any extra code maintenance complexity or usage cognitive load should be matched with demonstrable added user value that can be easily understood by both users and (future) contributors, e.g. via a short example (or preferably by extending an existing example), even if only on synthetic data in the event no suitable public dataset can be found in openml.org or similar.

adrinjalali · 2025-02-01T12:20:57Z

We're at a deadlock here and that's not a state I'd like to leave this at.

I personally don't have the datasets (they're all private) nor the energy to write up what @ogrisel would need to be happy with this moving forward.

At the same time, according to @ogrisel 's arguments, we should remove stratification alltogether, are we gonna do that?

I see enough support here to move forward. Should we call for a vote? What's the realistically actionable path forward? Anything other than this being blocked and stalled.

j-at-ch · 2025-02-03T10:44:39Z

@adrinjalali - I'll set aside some time this week to try and put together an argument & synthetic example illustrating the issue that this would solve, for those that haven't faced it before. Can any vote wait until the beginning of next week?

(Apologies for my delayed response.)

ogrisel · 2025-02-05T16:37:15Z

I personally don't have the datasets (they're all private) nor the energy to write up what @ogrisel would need to be happy with this moving forward.

+1 for moving forward, and I will try to find the time to educate myself proactively on valid potential uses of this option so that we can document how to properly use it in the docstring / user guide or examples.

@adrinjalali - I'll set aside some time this week to try and put together an argument & synthetic example illustrating the issue that this would solve, for those that haven't faced it before.

That would be great if you could contribute to this. Thanks!

At the same time, according to @ogrisel 's arguments, we should remove stratification alltogether, are we gonna do that?

I don't think we are ever going to do that because:

we would first need to update our CV tools to handle single class CV folds gracefully first (and I am not sure what's the proper way to address all edge cases correctly and without too much complexity);
but even if we did, I think this would be a disruptive breaking change / noisy deprecation cycle to handle, and I am really not sure if it's worth it.

ArturoAmorQ · 2025-02-12T12:56:10Z

While I am not fundamentally opposed to this feature, I wonder what concrete need it fulfills and what precise statistical question it answers. If we do decide to implement it, we should start with a convincing practical usage example that highlights how the lack of stratification in the CV evaluation loop would have otherwise led to mistaken conclusions.

I just found this paper (full version here and related code here) with a usecase in chemistry. It claims:

The systematic subsets contain more active molecules than would be expected by simple random
selection

In my understanding, stratifying on X ("structural diversity" in the terminology of said paper) mitigates data sampling biases. In the case of chemistry, a given lab may be more prompt to oversample a small region of the feature space (thus leading to over-representation of some combinations of features) whereas we want to cross-validate on folds with data coming from a feature-space as broad as possible.

j-at-ch · 2025-02-12T13:52:32Z

Nice @ArturoAmorQ!

I'm afraid I didn't quite get enough time last week to get a synthetic example together to a shareable standard, however...

I'm aiming for the same kind of argument with a healthcare application. There, one will usually be expected (if not for ethical, then for regulatory reasons) to have suitable representation of a number of categorical features (race, age groups, sex), both in the training set and the evaluation set. Usually these categories are very imbalanced: sometimes due to the fact that a subpopulation might be a small proportion of an overall population; sometimes because it is harder to get some subpopulations to consent to having their data used; (+other reasons...). This imbalance means that stratification is necessary in order to guarantee representation of the subpopulations.

Regarding of the precise statistical need that subpopulation-stratified CV would satisfy (and because I'm not a expert on sampling), I'll step back for a second, and pose a few questions with some brief answers:

What we do simple random CV for: this is usually because we want a data efficient method of estimating the expected prediction error (defined according to some loss) of a model.
Why do we do class-(/y-)stratified CV? I'm not so clear on the precise statistical need that this satisfies. Practically, this is often done to guarantee representation of objects that we wish to classify, where objects are defined by their class (y). We induce class distribution leakage between train and evaluation, so it seems there should be a cost to generalisation.
Why would we want to do subpopulation-(/feature-)stratified CV? A couple of the statistical needs here are represented by the items mentioned here, particularly 1: If measurements within strata have a lower standard deviation (as compared to the overall standard deviation in the population), stratification gives a smaller error in estimation.. Additionally there are the practical needs of representing the subpopulation, which will usually involve computing metrics over all subgroups, which may be undefined if members rarely find their way into the evaluation splits. I don't think the logic here is significantly different from that of class-stratified CV, except that we are thinking of subpopulations as defining objects, rather than class labels.

I haven't had a huge amount of luck searching ML papers addressing these points from a theoretical perspective, but am keen to continue exploring. If anyone else already has a better understanding of the precise statistical situation, would love to hear.

marenwestermann · 2025-02-16T13:47:01Z

@adrinjalali opened this issue because at the time I was woking on a machine learning project for my company that was focussing on predicting overcapacities on Germany's national railway network and this feature would have been useful for us. (Unfortunately I can't give more details without seeking approval first.)

I was thinking that maybe we could do a user survey on this (maybe just a LinkedIn post). We could make a simple poll on whether people think this feature is useful and also ask people if they could provide us with use cases. However, I'm aware that this involves a bit of work and I don't know if anyone here currently has the capacity to look after such a survey.

I'm pining @OpenRailAssociation here because this might potentially be relevant for other European railway organisations as well.

github-actions bot added the Needs Triage Issue requires triage label Jul 12, 2023

thomasjpfan added New Feature RFC and removed Needs Triage Issue requires triage labels Jul 24, 2023

adrinjalali mentioned this issue Jan 22, 2024

ENH Adds warning & docs for splitters that do not support goups #28210

Merged

adrinjalali mentioned this issue Mar 6, 2024

[MRG] add stratify and shuffle variants for GroupKFold #9413

Closed

adrinjalali mentioned this issue Apr 17, 2024

Stratifying Across Classes During Training in ShuffleSplit #5965 #5972

Closed

This was referenced Oct 21, 2024

extend StratifiedKFold to float for regression #4757

Open

[MRG] Binned regression cv #14560

Open

lorentzenchr mentioned this issue Oct 26, 2024

Add balance_regression option to train_test_split for regression problems #30009

Closed

adrinjalali mentioned this issue Nov 7, 2024

FEA Add RepeatedStratifiedGroupKFold as a new splitter #24227

Open

adrinjalali mentioned this issue Feb 1, 2025

y, and groups parameters toStratifiedGroupKFold.split() are optional #30742

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `strata` to `Stratified*Split` CV splitters #26821

Add `strata` to `Stratified*Split` CV splitters #26821

adrinjalali commented Jul 12, 2023

adarsh-meher commented Jul 14, 2023

adrinjalali commented Jul 14, 2023

thomasjpfan commented Jul 24, 2023

adrinjalali commented Jul 24, 2023

j-at-ch commented Jul 24, 2023

thomasjpfan commented Jul 24, 2023

ogrisel commented Dec 5, 2024

adrinjalali commented Dec 6, 2024

ogrisel commented Dec 6, 2024

adrinjalali commented Jan 2, 2025

ogrisel commented Jan 2, 2025 •

edited

Loading

lorentzenchr commented Jan 3, 2025

ogrisel commented Jan 3, 2025 •

edited

Loading

adrinjalali commented Feb 1, 2025

j-at-ch commented Feb 3, 2025

ogrisel commented Feb 5, 2025

ArturoAmorQ commented Feb 12, 2025

j-at-ch commented Feb 12, 2025 •

edited

Loading

marenwestermann commented Feb 16, 2025 •

edited

Loading

Add strata to Stratified*Split CV splitters #26821

Add strata to Stratified*Split CV splitters #26821

Comments

adrinjalali commented Jul 12, 2023

adarsh-meher commented Jul 14, 2023

adrinjalali commented Jul 14, 2023

thomasjpfan commented Jul 24, 2023

adrinjalali commented Jul 24, 2023

j-at-ch commented Jul 24, 2023

thomasjpfan commented Jul 24, 2023

ogrisel commented Dec 5, 2024

adrinjalali commented Dec 6, 2024

ogrisel commented Dec 6, 2024

adrinjalali commented Jan 2, 2025

ogrisel commented Jan 2, 2025 • edited Loading

lorentzenchr commented Jan 3, 2025

ogrisel commented Jan 3, 2025 • edited Loading

adrinjalali commented Feb 1, 2025

j-at-ch commented Feb 3, 2025

ogrisel commented Feb 5, 2025

ArturoAmorQ commented Feb 12, 2025

j-at-ch commented Feb 12, 2025 • edited Loading

marenwestermann commented Feb 16, 2025 • edited Loading

Add `strata` to `Stratified*Split` CV splitters #26821

Add `strata` to `Stratified*Split` CV splitters #26821

ogrisel commented Jan 2, 2025 •

edited

Loading

ogrisel commented Jan 3, 2025 •

edited

Loading

j-at-ch commented Feb 12, 2025 •

edited

Loading

marenwestermann commented Feb 16, 2025 •

edited

Loading