-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
SLEP006 - Metadata Routing task list #22893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you see this as a collaborative effort? To what extent can we still use #20350? To what extent can we share test components? Which of these metaestimators have existing fit param routing that needs to be deprecated? |
Your list appears to be missing some |
Also missing are functions in |
Most of what's in #20350 can be used. We can open this for collaboration once we figure out a nice way to deprecate existing routing strategies in our meta-estimators. I'll work on it this week. |
Also updated the list, it should be quite complete now I think. |
Re deprecation, one might have thought that using the old logic where there's no request from a meta-estimator's descendant consumers makes sense... but a splitter's request for |
Hey @adrinjalali, how is the work on meta-estimators going? I'm wondering if we should do something crazy like pair programming on it if it's proving hard to get started? |
I spent a few days trying to write common tests for meta-estimators but that didn't go anywhere. After being stuck for a while, @thomasjpfan and I spent some time last week together and we decided to start with simple individual tests, starting for one meta-estimator, and then refactor the tests later when we find recurring patterns. Right now I'm working on multioutput meta-estimators and should have a PR coming today. |
Yes, I think starting with individual tests makes sense, even if I've been curious about reusable components... I think the base components are already well enough tested that tests beyond that need not be super extensive. |
This PR adds metadata routing to BaggingClassifier and BaggingRegressor (see scikit-learn#22893). With this change, in addition to sample_weight, which was already supported, it's now also possible to pass arbitrary fit_params to the sub estimator. Implementation Most of the changes should be pretty straightforward with the existing infrastructure for testing metadata routing. There was one aspect which was not quite trivial though: The current implementation of bagging works by inspecting the sub estimator's fit method. If the sub estimator supports sample_weight, then subsampling is performed by making use of sample weight. This will also happen if the user does not explicitly pass sample weight. At first, I wanted to change the implementation such that if sample weights are requested, subsampling should use the sample weight approach, otherwise it shouldn't. However, that breaks a couple of tests, so I rolled back the change and stuck very closely to the existing implementation. I can't judge if this prevents the user from doing certain things or if subsampling using vs not using sample_weight are equivalent. Coincidental changes The method _validate_estimator on the BaseEnsemble class used to validate, and then set as attribute, the sub estimator. This was inconvenient because for get_metadata_routing, we want to fetch the sub estimator, which is not easily possible with this method. Therefore, a change was introduced that the method now returns the sub estimator and the caller is now responsible for setting it as an attribute. This has the added advantages that the caller can now decide the attribute name and that this method now more closely mirrors _BaseHeterogeneousEnsemble._validate_estimators. Affected by this change are random forests, extra trees, and ada boosting. The function process_routing used to mutate the incoming param dict (adding new items), now it creates a shallow copy first. Extended docstring for check_input of BaseBagging._fit. Testing I noticed that the bagging tests didn't have a test case for sparse input + using sample weights, so I extended an existing test to cover it. The test test_bagging_sample_weight_unsupported_but_passed now raises a TypeError, not ValueError, when sample_weight are passed but not supported.
Hi @adrinjalali Can I work on this issue? Would it be reasonable to start working on LogisticRegressionCV? |
@OmarManzoor you can give it a try, but beware the work in this issue is very involved, I would probably recommend something less involved at this point for you, but giving it a try doesn't hurt :) |
I tried checking out LogisticRegressionCV. On comparing it with the other meta-estimators whose PRs have been created and going through the overall mechanism of routing, this one seems a bit different as it does not seem to contain any child estimators. Instead it inherits from LogisticRegression and calls the functions _log_reg_scoring_path and _logistic_regression_path. Moreover it seems that the basic scoring might be also covered by an earlier PR of yours. |
Yes, I remember looking at that and it not being straightforward. In this case, instead of it being only a router, it's a consumer for whatever |
Here are some direct follow up items from #24027
|
Working on |
Working on |
I can work on I also noticed that SelfTrainingClassifier still has the |
I'm now working on StackingClassifier and StackingRegressor. |
I'll be working on |
Working on |
Working on |
I can start work on |
Working on |
I'm working on |
Do we still need to implement this for AdaBoostClassifier (#24026) I saw the corresponding PRs were closed, but wasn't sure if that was just cuz it was stalled, or unnecessary in this context. If we still need to implement in those cases, I can work on AdaBoostClassifier/Regressor. |
yes, we need to do them @adam2392 . But those PRs were stalled and closed, same for Bagging I think. There were some challenges there which you can see probably from those PRs. |
Bagging I previously addressed #28432 So I'll take a look at Adaboost! |
This issue is to track the work we need to do before we can merge
sample-props
branch intomain
:sample-props
. This PR only touchesBaseEstimator
and hence consumers. It does NOT touch meta-estimators, scorers or cv splitters.sample-props
: FEAT add metadata routing to splitters #22765sample-props
(note that this involves an ongoing discussion on whether we'd like to mutate a scorer or not): FEAT SLEP6: scorers #22757get_metadata_routing
in easy cases, and a whole lot more in cases where we'd like to keep backward compatibility in parsing input args suchs asestimator__param
inPipeline
sample-props
, do a few tests with third party estimators to see if they'd work out of the box. Note that consumer estimators should work out of the box as long as they inherit fromBaseEstimator
and theirfit
accepts metadata as explicit arguments rather than**kwargs
._metadata_requests.py
and work with scikit-learn meta-estimators w/o depending on the library.BaseEstimator
sample-props
intomain
: FEAT SLEP006: metadata routing infrastructure #24027Enhancements:
Cleanup
set_{method}_request
methods are all legit #26505Open issues:
set_{method}_request
methods #23933response_method
used by a scorer #27977Our plan is to hopefully have this feature in 1.1, which we should be releasing in late April/early May.
cc @jnothman @thomasjpfan @lorentzenchr
Misc:
From #24027
Once we're finished with merging PRs to
sample-props
branch, we should merge this.I've opened this for us to have an idea of what the whole project is touching.
We probably should rebase and merge this instead of squash and merge.
This closes and fixes the following issues and PRs:
Routing issues:
Enables Add TimeSeriesCV and HomogeneousTimeSeriesCV #6322,
AdaBoost:
Fixes Make Pipeline compatible with AdaBoost #2630,
Fixes fit_params for Adaboost estimator #21706
ClassifierChain:
Fixes ClassifierChain does not support GroupKFold #11429,
FeatureUnion:
Fixes
fit_params
in conjunction withFeatureUnion
#7136LogisticRegressionCV:
Fixes LogisticRegressionCV not compatible with LeaveOneGroupOut #8950,
Multioutput*
Fixes MultiOutputRegressor: Support for more fit parameters #15953
OneVsRestClassifier
Fixes OneVsRestClassifier's fit method does not accept kwargs #10882
RFE/RFECV:
Fixes RFE/RFECV doesn't work with sample weights #7308
*SearchCV:
Fixes Nested CV of LeaveOneGroupOut fails in permutation_test_score #8127,
Fixes grid_search: feeding parameters to scorer functions #8158,
cross_val*:
Fixes Weighted scoring in cross validation (Closes #4632) #13432,
Fixes Should cross-validation scoring take sample-weights into account? #4632,
Fixes GroupKFold fails in nested cross-validation (similar to #2879) #7646,
Fixes
model_selection.cross_validate
doesn't passgroups
argument to estimator #20349Stacking*
Fixes Support fit_params in stacking #18028,
Voting*
Fixes Enable mixed ensembles with estimators that do & don't accept the sample_weight fit_param #20167
The text was updated successfully, but these errors were encountered: