Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Use _check_sample_weight to consistently validate sample_weight #15358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
13 tasks done
NicolasHug opened this issue Oct 24, 2019 · 29 comments · Fixed by #15478, #15495, #15505, #15530 or #16322
Closed
13 tasks done

Use _check_sample_weight to consistently validate sample_weight #15358

NicolasHug opened this issue Oct 24, 2019 · 29 comments · Fixed by #15478, #15495, #15505, #15530 or #16322

Comments

@NicolasHug
Copy link
Member

NicolasHug commented Oct 24, 2019

We recently introduced utils.validation._check_sample_weight which returns a validated sample_weight array.

We should use it consistently throughout the code base, instead of relying on custom and adhoc checks like check_consistent_lenght or check_array (which are now handled by _check_sample_weight).

Here's a list of the estimators/functions that could make use of it (mostly in fit or partial_fit):

  • CalibratedClassifierCV
  • DBSCAN
  • DummyClassifier
  • DummyRegressor
  • BaseBagging
  • BaseForest
  • BaseGradientBoosting
  • IsotonicRegression
  • KernelRidge
  • GaussianNB
  • BaseDiscreteNB
  • KernelDensity
  • BaseDecisionTree

(I left-out the linear_model module because it seems more involved there)

Could be a decent sprint issue @amueller ?

To know where a given class is defined, use e.g. git grep -n "class DBSCAN"

@lorentzenchr
Copy link
Member

@NicolasHug I could give it a try. Furthermore, should _check_sample_weight also guarantee non-negativeness and sum(sw) > 0 ?

@rth
Copy link
Member

rth commented Nov 2, 2019

I think for the above mentioned estimators @NicolasHug intended this as an easier refactoring issues for new contributors, but if you want to look into it feel free to open PRs.

(I left-out the linear_model module because it seems more involved there)

@lorentzenchr Your expertise would certainly be appreciated there. As you mention #15438 there is definitely work to be done on improving sample_weight handling consistency in linear models.

Furthermore, should _check_sample_weight also guarantee non-negativeness and sum(sw) > 0 ?

There are some use cases when it is useful, see #12464 (comment) but in most cases it would indeed make sense to error on them. In the linked issues it was suggested maybe to enable this check but then allow it to be disabled with a global flag in sklearn.set_config. So adding that global config flag and adding the corresponding check in _check_sample_weight could be a separate PR.

@lorentzenchr
Copy link
Member

intended this as an easier refactoring issues for new contributors

In that case, I will focus on _check_sample_weight and on linear models. So new contributors are still welcome 😃

@NicolasHug
Copy link
Member Author

Thanks @lorentzenchr .

BTW @rth maybe we should add a return_ones parameter to _check_sample_weights. It would be more convenient than protecting the call with if sample_weight is not None:...

@amueller
Copy link
Member

amueller commented Nov 2, 2019

Should sample_weights be made an array in all cases? I feel like we have shortcuts if it's None often and I don't see why we would introduce a multiplication with a constant array.

Also: negative sample weights used to be allowed in tree-based models, not sure if they still are.

@NicolasHug
Copy link
Member Author

Should sample_weights be made an array in all cases?

No hence my comment above

@amueller
Copy link
Member

amueller commented Nov 2, 2019

Oh sorry didn't see that ;)

@salliewalecka
Copy link
Contributor

@fbchow and I will pick up the fix for BaseDecisionTree for the scikitlearn sprint

@kesshijordan
Copy link
Contributor

@fbchow we will do DBSCAN for the wmlds scikitlearn sprint (pair programming @akeshavan)

@lakrish
Copy link
Contributor

lakrish commented Nov 2, 2019

working on BaseBagging for the wmlds sprint with Honglu Zhang (@ritalulu)

@theoptips
Copy link
Contributor

@MDouriez and I will work on the GaussianNB one

@kesshijordan
Copy link
Contributor

kesshijordan commented Nov 2, 2019

Working on BaseGradientBoosting for wimlds sprint (pair programming @akeshavan)

@ritalulu
Copy link
Contributor

ritalulu commented Nov 2, 2019

Working on BaseForest for wimlds sprint (pair programming @lakrish)

@amueller
Copy link
Member

amueller commented Nov 2, 2019

often the check is within a if sample_weights is not None: so we wouldn't need to add an argument

@salliewalecka
Copy link
Contributor

@fbchow and I will move onto DummyClassifier.

@theoptips
Copy link
Contributor

theoptips commented Nov 2, 2019

[MRG] Fixed DummyRegressor
test_dummy.py passed
PR here

@ritalulu
Copy link
Contributor

ritalulu commented Nov 2, 2019

Working on the IsotonicRegression one (notice that the isotonic_regression function doesn't have a validating step for sample_weight, so no need to work on that one)

@salliewalecka
Copy link
Contributor

Picking up KernelDensity with @fbchow

@salliewalecka
Copy link
Contributor

salliewalecka commented Nov 2, 2019

In case people are wondering the ones left up for grabs are

CalibratedClassifierCV
KernelRidge
BaseDiscreteNB

@rth
Copy link
Member

rth commented Nov 15, 2019

We now need a github feature to prevent an issue from being closed by PRs..

@cmarmo
Copy link
Contributor

cmarmo commented Jan 13, 2020

@NicolasHug I've updated the list of the estimators already generalized: I think that the issue will be more readable if the list is on the top of the page. Do you think you could find a minute to update it? Thanks a lot!

@NicolasHug
Copy link
Member Author

Done, thanks @cmarmo

@lithomas1
Copy link
Contributor

lithomas1 commented Jan 18, 2020

can take KernelRidge

@marijavlajic
Copy link
Contributor

We'd like to try to tackle this one for BaseDiscreteNB. @gelavizh1 @lschwetlick #ScikitLearnSpint

@marijavlajic
Copy link
Contributor

Now working with @gelavizh1 and @lschwetlick on IsotonicRegression.

@NicolasHug
Copy link
Member Author

@jeremiedbb it looks like everything has been addressed? We can open a specific issue for the linear models. BTW looks like there was already an open PR for IsotonicRegression...

@jeremiedbb
Copy link
Member

jeremiedbb commented Jan 30, 2020

BaseDiscreteNB has not been addressed yet. It's done actually (I reviewed it :/)

BTW looks like there was already an open PR for IsotonicRegression...

Yes I guess it has been overtook a bit quick. I'm closing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment