-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Add strata
to Stratified*Split
CV splitters
#26821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @adrinjalali . I am new to open source contributions and looking for an opportunity to get started with it . Let me know if I can work on this issue . |
We still don't have a consensus here on how to proceed @adarsh-meher , It's better to work on issues which are marked as "help wanted" and "good first issue". |
All the
Currently, I am -1 on the proposal of adding |
@thomasjpfan Imagine you have a dataset where one attribute is the state (location) of the event, and due to certain issues (laws / etc) you want to have a similar proportions of all states' data in train and test, then you stratify on state rather than output. Another example is when you have train / flight data, and each route has its own peculiarities, but you want to train on the whole dataset rather than a single route, while having enough data from each route on train and test; here you'd stratify on "route", while I'm happy to move this to |
Just a quick comment from a regular user: +1 can confirm that the proposed feature has a use case that's prevalent in healthcare: stratifying across protected characteristics such as demographics (usually different from the target). (Note we'd usually also want to group by patient). Also would humbly agree |
I see the use cases and now +1 on introducing |
The purpose of the However, some people might want left-out groupwise splitting on some metadata and stratification on some other metadata, no? |
For me, this is what I'd aim for I think: cv = KFold(shuffle=True, n_folds=5).set_split_request(groups=True, strata=True)
GridSearchCV(..., cv=cv).fit(X, y, strata=strata_var, groups=groups_var)
# or
cross_validate(..., cv=cv, params={"groups": groups_var, "strata": strata_var}) |
While I am not fundamentally opposed to this feature, I wonder what concrete need it fulfills and what precise statistical question it answers. If we do decide to implement it, we should start with a convincing practical usage example that highlights how the lack of stratification in the CV evaluation loop would have otherwise led to mistaken conclusions. Disclaimer: at this time I do not see how to come up with such an example, but I would love to learn something. |
@j-at-ch do you have something at hand as an example to answer @ogrisel ? I don't have a dataset right now, I've only used it as you say in healthcare related work and seen it in other areas as well, but no dataset at hand. @ogrisel, on the other hand, you can believe us when we say this is what we do / have done in practice. |
It's really not a problem of trust but more of a problem of transparency and education. We should not offer newt tools for our users to shoot themselves in the foot without a clear notice that documents proper use and known pitfalls. |
+1 from me. Also without a dataset. The interesting ones are most not public. The arg should be named something like
I don’t think we should be overprotective like that. |
I sincerely believe that any extra code maintenance complexity or usage cognitive load should be matched with demonstrable added user value that can be easily understood by both users and (future) contributors, e.g. via a short example (or preferably by extending an existing example), even if only on synthetic data in the event no suitable public dataset can be found in openml.org or similar. |
We're at a deadlock here and that's not a state I'd like to leave this at. I personally don't have the datasets (they're all private) nor the energy to write up what @ogrisel would need to be happy with this moving forward. At the same time, according to @ogrisel 's arguments, we should remove stratification alltogether, are we gonna do that? I see enough support here to move forward. Should we call for a vote? What's the realistically actionable path forward? Anything other than this being blocked and stalled. |
@adrinjalali - I'll set aside some time this week to try and put together an argument & synthetic example illustrating the issue that this would solve, for those that haven't faced it before. Can any vote wait until the beginning of next week? (Apologies for my delayed response.) |
+1 for moving forward, and I will try to find the time to educate myself proactively on valid potential uses of this option so that we can document how to properly use it in the docstring / user guide or examples.
That would be great if you could contribute to this. Thanks!
I don't think we are ever going to do that because:
|
I just found this paper (full version here and related code here) with a usecase in chemistry. It claims:
In my understanding, stratifying on X ("structural diversity" in the terminology of said paper) mitigates data sampling biases. In the case of chemistry, a given lab may be more prompt to oversample a small region of the feature space (thus leading to over-representation of some combinations of features) whereas we want to cross-validate on folds with data coming from a feature-space as broad as possible. |
Nice @ArturoAmorQ! I'm afraid I didn't quite get enough time last week to get a synthetic example together to a shareable standard, however... I'm aiming for the same kind of argument with a healthcare application. There, one will usually be expected (if not for ethical, then for regulatory reasons) to have suitable representation of a number of categorical features (race, age groups, sex), both in the training set and the evaluation set. Usually these categories are very imbalanced: sometimes due to the fact that a subpopulation might be a small proportion of an overall population; sometimes because it is harder to get some subpopulations to consent to having their data used; (+other reasons...). This imbalance means that stratification is necessary in order to guarantee representation of the subpopulations. Regarding of the precise statistical need that subpopulation-stratified CV would satisfy (and because I'm not a expert on sampling), I'll step back for a second, and pose a few questions with some brief answers:
I haven't had a huge amount of luck searching ML papers addressing these points from a theoretical perspective, but am keen to continue exploring. If anyone else already has a better understanding of the precise statistical situation, would love to hear. |
@adrinjalali opened this issue because at the time I was woking on a machine learning project for my company that was focussing on predicting overcapacities on Germany's national railway network and this feature would have been useful for us. (Unfortunately I can't give more details without seeking approval first.) I was thinking that maybe we could do a user survey on this (maybe just a LinkedIn post). We could make a simple poll on whether people think this feature is useful and also ask people if they could provide us with use cases. However, I'm aware that this involves a bit of work and I don't know if anyone here currently has the capacity to look after such a survey. I'm pining @OpenRailAssociation here because this might potentially be relevant for other European railway organisations as well. |
Right now
Stratified*Split
classes takey
as the strata, while that's not always the case.In
train_test_split
we allow astratify
arg (which I'm wondering if it should be calledstrata
), which defines the groups samples belong to. And inside, we basically do this:As you can see, we're passing
stratify
asy
to the splitter. I think it would make sense to add astrata
arg to the splitters, and ifNone
, we'd take values iny
instead, as it is now.Note that now that we have SLEP6, we won't need a separate class for them, and we'd simply request
strata
for the splitter:cc @marenwestermann
The text was updated successfully, but these errors were encountered: