Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add strata to Stratified*Split CV splitters #26821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adrinjalali opened this issue Jul 12, 2023 · 19 comments
Open

Add strata to Stratified*Split CV splitters #26821

adrinjalali opened this issue Jul 12, 2023 · 19 comments

Comments

@adrinjalali
Copy link
Member

Right now Stratified*Split classes take y as the strata, while that's not always the case.

In train_test_split we allow a stratify arg (which I'm wondering if it should be called strata), which defines the groups samples belong to. And inside, we basically do this:

    cv = StratifiedShuffleSplit(test_size=n_test, train_size=n_train, random_state=random_state)

    train, test = next(cv.split(X=arrays[0], y=stratify))

    return list(
        chain.from_iterable(
            (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays
        )
    )

As you can see, we're passing stratify as y to the splitter. I think it would make sense to add a strata arg to the splitters, and if None, we'd take values in y instead, as it is now.

Note that now that we have SLEP6, we won't need a separate class for them, and we'd simply request strata for the splitter:

cv = StratifiedShuffleSplit().set_split_request(strata=True)
...
GridSearchCV(model, param_grid, cv=cv).fit(X, y, strata=strata_values)
cross_validate(model, X, y, cv=cv, props={"strata": strata_values})

cc @marenwestermann

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jul 12, 2023
@adarsh-meher
Copy link

Hi @adrinjalali . I am new to open source contributions and looking for an opportunity to get started with it . Let me know if I can work on this issue .

@adrinjalali
Copy link
Member Author

We still don't have a consensus here on how to proceed @adarsh-meher , It's better to work on issues which are marked as "help wanted" and "good first issue".

@thomasjpfan
Copy link
Member

All the Stratified*Split CVs assume that is it stratifying on y. In the ML context, I think this is the common use of the term "stratify". If there needs to be other "grouping information" then it becomes a "Group*CV".

train_test_split needs a stratify parameter, because it can split any number of arrays, so there is no way to tell which array is the target. Looking through the codebase, only resample and train_test_split has a stratify parameter because of the same reason.

Currently, I am -1 on the proposal of adding strata to Stratified*Split CVs.

@thomasjpfan thomasjpfan added New Feature RFC and removed Needs Triage Issue requires triage labels Jul 24, 2023
@adrinjalali
Copy link
Member Author

@thomasjpfan Group*CV and Stratified*CV are pretty much on the opposite side of one another. One makes sure of each group to be exactly in one of train and test sets, the other makes sure each group is split proportionately in train and test sets.

Imagine you have a dataset where one attribute is the state (location) of the event, and due to certain issues (laws / etc) you want to have a similar proportions of all states' data in train and test, then you stratify on state rather than output.

Another example is when you have train / flight data, and each route has its own peculiarities, but you want to train on the whole dataset rather than a single route, while having enough data from each route on train and test; here you'd stratify on "route", while y might be "delay".

I'm happy to move this to Group*CV, but then we'd need to add a parameter to decide whether the split needs to keep each group in one of train / test, or to keep them proportionate for each train / test.

@j-at-ch
Copy link

j-at-ch commented Jul 24, 2023

Just a quick comment from a regular user:

+1 can confirm that the proposed feature has a use case that's prevalent in healthcare: stratifying across protected characteristics such as demographics (usually different from the target). (Note we'd usually also want to group by patient).

Also would humbly agree strata is a more informative name than stratify.

@thomasjpfan
Copy link
Member

I see the use cases and now +1 on introducing strata to the splitters.

@ogrisel
Copy link
Member

ogrisel commented Dec 5, 2024

I'm happy to move this to Group*CV, but then we'd need to add a parameter to decide whether the split needs to keep each group in one of train / test, or to keep them proportionate for each train / test.

The purpose of the Group*CV is the opposite: generalize across groups. So I would not imply that stratification is related to what we currently name "group*" splitters.

However, some people might want left-out groupwise splitting on some metadata and stratification on some other metadata, no?

@adrinjalali
Copy link
Member Author

For me, this is what I'd aim for I think:

cv = KFold(shuffle=True, n_folds=5).set_split_request(groups=True, strata=True)

GridSearchCV(..., cv=cv).fit(X, y, strata=strata_var, groups=groups_var)

# or
cross_validate(..., cv=cv, params={"groups": groups_var, "strata": strata_var})

@ogrisel
Copy link
Member

ogrisel commented Dec 6, 2024

While I am not fundamentally opposed to this feature, I wonder what concrete need it fulfills and what precise statistical question it answers. If we do decide to implement it, we should start with a convincing practical usage example that highlights how the lack of stratification in the CV evaluation loop would have otherwise led to mistaken conclusions.

Disclaimer: at this time I do not see how to come up with such an example, but I would love to learn something.

@adrinjalali
Copy link
Member Author

@j-at-ch do you have something at hand as an example to answer @ogrisel ? I don't have a dataset right now, I've only used it as you say in healthcare related work and seen it in other areas as well, but no dataset at hand.

@ogrisel, on the other hand, you can believe us when we say this is what we do / have done in practice.

@ogrisel
Copy link
Member

ogrisel commented Jan 2, 2025

@ogrisel, on the other hand, you can believe us when we say this is what we do / have done in practice.

It's really not a problem of trust but more of a problem of transparency and education. We should not offer newt tools for our users to shoot themselves in the foot without a clear notice that documents proper use and known pitfalls.

@lorentzenchr
Copy link
Member

+1 from me. Also without a dataset. The interesting ones are most not public.

The arg should be named something like stratum or stratum_id (or strata). Ideally, the default is such that for classification it uses the target classes/labels.
As I would rather prefer to reduce the amount of splitters, I‘d favor to introduce it as new arg instead of a new splitter.

We should not offer newt tools for our users to shoot themselves in the foot without a clear notice that documents proper use and known pitfalls.

I don’t think we should be overprotective like that.

@ogrisel
Copy link
Member

ogrisel commented Jan 3, 2025

I don’t think we should be overprotective like that.

I sincerely believe that any extra code maintenance complexity or usage cognitive load should be matched with demonstrable added user value that can be easily understood by both users and (future) contributors, e.g. via a short example (or preferably by extending an existing example), even if only on synthetic data in the event no suitable public dataset can be found in openml.org or similar.

@adrinjalali
Copy link
Member Author

We're at a deadlock here and that's not a state I'd like to leave this at.

I personally don't have the datasets (they're all private) nor the energy to write up what @ogrisel would need to be happy with this moving forward.

At the same time, according to @ogrisel 's arguments, we should remove stratification alltogether, are we gonna do that?

I see enough support here to move forward. Should we call for a vote? What's the realistically actionable path forward? Anything other than this being blocked and stalled.

@j-at-ch
Copy link

j-at-ch commented Feb 3, 2025

@adrinjalali - I'll set aside some time this week to try and put together an argument & synthetic example illustrating the issue that this would solve, for those that haven't faced it before. Can any vote wait until the beginning of next week?

(Apologies for my delayed response.)

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2025

I personally don't have the datasets (they're all private) nor the energy to write up what @ogrisel would need to be happy with this moving forward.

+1 for moving forward, and I will try to find the time to educate myself proactively on valid potential uses of this option so that we can document how to properly use it in the docstring / user guide or examples.

@adrinjalali - I'll set aside some time this week to try and put together an argument & synthetic example illustrating the issue that this would solve, for those that haven't faced it before.

That would be great if you could contribute to this. Thanks!

At the same time, according to @ogrisel 's arguments, we should remove stratification alltogether, are we gonna do that?

I don't think we are ever going to do that because:

  • we would first need to update our CV tools to handle single class CV folds gracefully first (and I am not sure what's the proper way to address all edge cases correctly and without too much complexity);
  • but even if we did, I think this would be a disruptive breaking change / noisy deprecation cycle to handle, and I am really not sure if it's worth it.

@ArturoAmorQ
Copy link
Member

While I am not fundamentally opposed to this feature, I wonder what concrete need it fulfills and what precise statistical question it answers. If we do decide to implement it, we should start with a convincing practical usage example that highlights how the lack of stratification in the CV evaluation loop would have otherwise led to mistaken conclusions.

I just found this paper (full version here and related code here) with a usecase in chemistry. It claims:

The systematic subsets contain more active molecules than would be expected by simple random
selection

In my understanding, stratifying on X ("structural diversity" in the terminology of said paper) mitigates data sampling biases. In the case of chemistry, a given lab may be more prompt to oversample a small region of the feature space (thus leading to over-representation of some combinations of features) whereas we want to cross-validate on folds with data coming from a feature-space as broad as possible.

Image

Image

@j-at-ch
Copy link

j-at-ch commented Feb 12, 2025

Nice @ArturoAmorQ!

I'm afraid I didn't quite get enough time last week to get a synthetic example together to a shareable standard, however...

I'm aiming for the same kind of argument with a healthcare application. There, one will usually be expected (if not for ethical, then for regulatory reasons) to have suitable representation of a number of categorical features (race, age groups, sex), both in the training set and the evaluation set. Usually these categories are very imbalanced: sometimes due to the fact that a subpopulation might be a small proportion of an overall population; sometimes because it is harder to get some subpopulations to consent to having their data used; (+other reasons...). This imbalance means that stratification is necessary in order to guarantee representation of the subpopulations.

Regarding of the precise statistical need that subpopulation-stratified CV would satisfy (and because I'm not a expert on sampling), I'll step back for a second, and pose a few questions with some brief answers:

  • What we do simple random CV for: this is usually because we want a data efficient method of estimating the expected prediction error (defined according to some loss) of a model.

  • Why do we do class-(/y-)stratified CV? I'm not so clear on the precise statistical need that this satisfies. Practically, this is often done to guarantee representation of objects that we wish to classify, where objects are defined by their class (y). We induce class distribution leakage between train and evaluation, so it seems there should be a cost to generalisation.

  • Why would we want to do subpopulation-(/feature-)stratified CV? A couple of the statistical needs here are represented by the items mentioned here, particularly 1: If measurements within strata have a lower standard deviation (as compared to the overall standard deviation in the population), stratification gives a smaller error in estimation.. Additionally there are the practical needs of representing the subpopulation, which will usually involve computing metrics over all subgroups, which may be undefined if members rarely find their way into the evaluation splits. I don't think the logic here is significantly different from that of class-stratified CV, except that we are thinking of subpopulations as defining objects, rather than class labels.

I haven't had a huge amount of luck searching ML papers addressing these points from a theoretical perspective, but am keen to continue exploring. If anyone else already has a better understanding of the precise statistical situation, would love to hear.

@marenwestermann
Copy link
Member

marenwestermann commented Feb 16, 2025

@adrinjalali opened this issue because at the time I was woking on a machine learning project for my company that was focussing on predicting overcapacities on Germany's national railway network and this feature would have been useful for us. (Unfortunately I can't give more details without seeking approval first.)

I was thinking that maybe we could do a user survey on this (maybe just a LinkedIn post). We could make a simple poll on whether people think this feature is useful and also ask people if they could provide us with use cases. However, I'm aware that this involves a bit of work and I don't know if anyone here currently has the capacity to look after such a survey.

I'm pining @OpenRailAssociation here because this might potentially be relevant for other European railway organisations as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants