Thanks to visit codestin.com
Credit goes to github.com

Skip to content

SLEP006 - Metadata Routing task list #22893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
20 of 28 tasks
adrinjalali opened this issue Mar 18, 2022 · 40 comments
Open
20 of 28 tasks

SLEP006 - Metadata Routing task list #22893

adrinjalali opened this issue Mar 18, 2022 · 40 comments
Labels
Hard Hard level of difficulty Meta-issue General issue associated to an identified list of tasks

Comments

@adrinjalali
Copy link
Member

adrinjalali commented Mar 18, 2022

This issue is to track the work we need to do before we can merge sample-props branch into main:

Enhancements:

Cleanup

Open issues:

Our plan is to hopefully have this feature in 1.1, which we should be releasing in late April/early May.

cc @jnothman @thomasjpfan @lorentzenchr

Misc:


From #24027

Once we're finished with merging PRs to sample-props branch, we should merge this.

I've opened this for us to have an idea of what the whole project is touching.

We probably should rebase and merge this instead of squash and merge.

This closes and fixes the following issues and PRs:

@adrinjalali adrinjalali added the Meta-issue General issue associated to an identified list of tasks label Mar 18, 2022
@jnothman
Copy link
Member

Do you see this as a collaborative effort? To what extent can we still use #20350? To what extent can we share test components? Which of these metaestimators have existing fit param routing that needs to be deprecated?

@jnothman
Copy link
Member

Your list appears to be missing some *CV estimators (E.g. ElasticNetCV) that will have to route to splitters if not scorers.

@jnothman
Copy link
Member

Also missing are functions in sklearn.model_selection._validation

@adrinjalali
Copy link
Member Author

Do you see this as a collaborative effort? To what extent can we still use #20350? To what extent can we share test components? Which of these metaestimators have existing fit param routing that needs to be deprecated?

Most of what's in #20350 can be used. We can open this for collaboration once we figure out a nice way to deprecate existing routing strategies in our meta-estimators. I'll work on it this week.

@adrinjalali
Copy link
Member Author

Also updated the list, it should be quite complete now I think.

@jnothman
Copy link
Member

jnothman commented Mar 23, 2022

Re deprecation, one might have thought that using the old logic where there's no request from a meta-estimator's descendant consumers makes sense... but a splitter's request for groups by default might mess that up...

@jnothman
Copy link
Member

Hey @adrinjalali, how is the work on meta-estimators going? I'm wondering if we should do something crazy like pair programming on it if it's proving hard to get started?

@adrinjalali
Copy link
Member Author

I spent a few days trying to write common tests for meta-estimators but that didn't go anywhere. After being stuck for a while, @thomasjpfan and I spent some time last week together and we decided to start with simple individual tests, starting for one meta-estimator, and then refactor the tests later when we find recurring patterns.

Right now I'm working on multioutput meta-estimators and should have a PR coming today.

@jnothman
Copy link
Member

Yes, I think starting with individual tests makes sense, even if I've been curious about reusable components... I think the base components are already well enough tested that tests beyond that need not be super extensive.

@adrinjalali adrinjalali added this to the 1.2 milestone Mar 29, 2022
BenjaminBossan added a commit to BenjaminBossan/scikit-learn that referenced this issue Aug 24, 2022
This PR adds metadata routing to BaggingClassifier and
BaggingRegressor (see scikit-learn#22893).

With this change, in addition to sample_weight, which was already
supported, it's now also possible to pass arbitrary fit_params to the
sub estimator.

Implementation

Most of the changes should be pretty straightforward with the existing
infrastructure for testing metadata routing. There was one aspect which
was not quite trivial though: The current implementation of bagging
works by inspecting the sub estimator's fit method. If the sub estimator
supports sample_weight, then subsampling is performed by making use of
sample weight. This will also happen if the user does not explicitly
pass sample weight.

At first, I wanted to change the implementation such that if sample
weights are requested, subsampling should use the sample weight
approach, otherwise it shouldn't. However, that breaks a couple of
tests, so I rolled back the change and stuck very closely to the
existing implementation. I can't judge if this prevents the user from
doing certain things or if subsampling using vs not using sample_weight
are equivalent.

Coincidental changes

The method _validate_estimator on the BaseEnsemble class used to
validate, and then set as attribute, the sub estimator. This was
inconvenient because for get_metadata_routing, we want to fetch the sub
estimator, which is not easily possible with this method. Therefore, a
change was introduced that the method now returns the sub estimator and
the caller is now responsible for setting it as an attribute. This has
the added advantages that the caller can now decide the attribute name
and that this method now more closely mirrors
_BaseHeterogeneousEnsemble._validate_estimators. Affected by this change
are random forests, extra trees, and ada boosting.

The function process_routing used to mutate the incoming param
dict (adding new items), now it creates a shallow copy first.

Extended docstring for check_input of BaseBagging._fit.

Testing

I noticed that the bagging tests didn't have a test case for sparse
input + using sample weights, so I extended an existing test to cover
it.

The test test_bagging_sample_weight_unsupported_but_passed now raises a
TypeError, not ValueError, when sample_weight are passed but not
supported.
@OmarManzoor
Copy link
Contributor

OmarManzoor commented Sep 12, 2022

Hi @adrinjalali Can I work on this issue? Would it be reasonable to start working on LogisticRegressionCV?

@adrinjalali adrinjalali added the Hard Hard level of difficulty label Sep 12, 2022
@adrinjalali
Copy link
Member Author

@OmarManzoor you can give it a try, but beware the work in this issue is very involved, I would probably recommend something less involved at this point for you, but giving it a try doesn't hurt :)

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Sep 13, 2022

@OmarManzoor you can give it a try, but beware the work in this issue is very involved, I would probably recommend something less involved at this point for you, but giving it a try doesn't hurt :)

I tried checking out LogisticRegressionCV. On comparing it with the other meta-estimators whose PRs have been created and going through the overall mechanism of routing, this one seems a bit different as it does not seem to contain any child estimators. Instead it inherits from LogisticRegression and calls the functions _log_reg_scoring_path and _logistic_regression_path. Moreover it seems that the basic scoring might be also covered by an earlier PR of yours.

@adrinjalali
Copy link
Member Author

Yes, I remember looking at that and it not being straightforward. In this case, instead of it being only a router, it's a consumer for whatever LogisticRegression accepts, and a router for the CV and scorer.

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 2, 2023

Here are some direct follow up items from #24027

@StefanieSenger
Copy link
Contributor

Working on FeatureUnion. :)

@StefanieSenger
Copy link
Contributor

Working on RANSACRegressor now.

@adam2392
Copy link
Member

I can work on SelfTrainingClassifier.

I also noticed that SelfTrainingClassifier still has the base_estimator keyword argument. Should I simultaneously deprecate that in favor of estimator like it is done in the rest of the repo?

@StefanieSenger
Copy link
Contributor

I'm now working on StackingClassifier and StackingRegressor.

@StefanieSenger
Copy link
Contributor

I'll be working on learning_curve next.

@OmarManzoor
Copy link
Contributor

Working on TransformedTargetRegressor

@OmarManzoor
Copy link
Contributor

Working on SequentialFeatureSelector

@adam2392
Copy link
Member

I can start work on permutation_test_score

@OmarManzoor
Copy link
Contributor

Working on RFE and RFECV

@StefanieSenger
Copy link
Contributor

I'm working on validation_curve.

@adam2392
Copy link
Member

Do we still need to implement this for AdaBoostClassifier (#24026)
and AdaBoostRegressor (#24026)?

I saw the corresponding PRs were closed, but wasn't sure if that was just cuz it was stalled, or unnecessary in this context.

If we still need to implement in those cases, I can work on AdaBoostClassifier/Regressor.

@adrinjalali
Copy link
Member Author

yes, we need to do them @adam2392 . But those PRs were stalled and closed, same for Bagging I think. There were some challenges there which you can see probably from those PRs.

@adam2392
Copy link
Member

Bagging I previously addressed #28432

So I'll take a look at Adaboost!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hard Hard level of difficulty Meta-issue General issue associated to an identified list of tasks
Projects
Status: No status
Development

No branches or pull requests

9 participants