Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Implement SLEP009: keyword-only arguments #15005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
39 of 40 tasks
jnothman opened this issue Sep 17, 2019 · 26 comments
Closed
39 of 40 tasks

Implement SLEP009: keyword-only arguments #15005

jnothman opened this issue Sep 17, 2019 · 26 comments
Labels
Milestone

Comments

@jnothman
Copy link
Member

jnothman commented Sep 17, 2019

SLEP009 is all but accepted.

It proposes to make most parameters keyword-only.

We should do this by first:

  • Merging ENH Add Deprecating Position Arguments Helper #13311
  • Perhaps getting some stats on usage of positional arguments as per SLEP009: keyword only arguments enhancement_proposals#19 (comment)
  • applying the deprecation to each subpackage. Checked means PR opened at least.
    • base
    • calibration
    • cluster
    • compose
    • covariance
    • cross_decomposition
    • datasets
    • decomposition
    • discriminant_analysis
    • dummy
    • ensemble
    • feature_extraction
    • feature_selection
    • gaussian_process
    • impute
    • inspection
    • isotonic
    • kernel_approximation
    • kernel_ridge
    • linear_model
    • manifold
    • metrics
    • metrics.pairwise
    • mixture
    • model_selection
    • multiclass
    • multioutput
    • naive_bayes
    • neighbors
    • neural_network
    • pipeline
    • preprocessing
    • random_projection
    • semi_supervised
    • svm
    • tree
    • utils

We might along the way establish rules of thumb and principles like "are the semantics reasonably clear when the argument is passed positionally?" As I noted on the mailing list, I think they are clear for PCA's components, for Pipeline's steps, and for GridSearchCV's estimator and parameter grid. Other parameters of those estimators seem more suitable for keyword-only. Trickier is whether n_components in TSNE should follow PCA in being positional... It's not as commonly set by users.

@jnothman jnothman added this to the 0.22 milestone Sep 17, 2019
@adrinjalali
Copy link
Member

Realistically, this should be in 0.23. We already have too much on our plate, I think.

@jnothman
Copy link
Member Author

jnothman commented Sep 18, 2019 via email

@qinhanmin2014
Copy link
Member

I think they are clear for PCA's components, for Pipeline's steps, and for GridSearchCV's estimator and parameter grid.

I'm wondering whether it's easy to decide which patameters should be positional, e.g., do we think PCA(2) is reasonable? At least I don't like it.

And how shall we make the decision? I guess we don't want to open a vote for every class/function, right? So is +2 enough, or do we need +3?

@rth
Copy link
Member

rth commented Sep 18, 2019

e.g., do we think PCA(2) is reasonable? At least I don't like it.

It's about leaving some of this to the user discretion, or forcing them to use what we think best. Yes, PCA(2) is bad, while PCA(n_components) is reasonable. At least I wouldn't object to a PR using it, would you? Users can resent a limitation of their freedom (when to use or not position args) in cases when what there is no overwhelming reason for it.

So is +2 enough, or do we need +3?

I would say +2 since the SLEP was accepted but wait at bit before merging to give the possibility for feedback?

@qinhanmin2014
Copy link
Member

At least I wouldn't object to a PR using it, would you?

I don't like it but I'm not opposed to it.

Users can resent a limitation of their freedom (when to use or not position args) in cases when what there is no overwhelming reason for it.

Hmm, not sure, but the SLEP is passed?

@rth
Copy link
Member

rth commented Sep 18, 2019

At least I wouldn't object to a PR using it, would you?

I don't like it but I'm not opposed to it.

Let's put in another way. As a user, imagine currently I have a few 1000 lines of perfectly fine code that uses PCA(n_components) (and other comparable use cases). If tomorrow it's going to start raising warnings to change it to PCA(n_components=n_components) and require me to do maintenance work without good reason, personally I would be unhappy and will complain about it to whatever project did that.

@jnothman
Copy link
Member Author

jnothman commented Sep 18, 2019 via email

@amueller
Copy link
Member

@agramfort explicitly argued for accepting PCA(n_components).

I really think the point made on the mailing list about allowing users to have clear expectations is important. If we can't write down a simple rule it's hard for users to have clear expectations.

Doing some quick stats:

from collections import Counter
from inspect import signature
from sklearn.utils.testing import all_estimators

counts = Counter()

for name, est in all_estimators():
    sig = signature(est)
    if len(sig.parameters):
        first = list(sig.parameters.keys())[0]
        counts[first] += 1

There's 61 different first arguments in our estimators.

{'n_components': 25, 'alpha': 12, 'estimator': 11, 'base_estimator': 8, 'kernel': 8, 'n_clusters': 7, 'store_precision': 6, 'n_estimators': 6, 'n_neighbors': 6, 'loss': 6, 'score_func': 6, 'alphas': 5, 'fit_intercept': 5, 'criterion': 5, 'C': 3, 'eps': 3, 'input': 3, 'copy': 3, 'threshold': 3, 'penalty': 3, 'missing_values': 3, 'bandwidth': 2, 'priors': 2, 'l1_ratio': 2, 'strategy': 2, 'epsilon': 2, 'estimators': 2, 'n_bins': 2, 'n_iter': 2, 'hidden_layer_sizes': 2, 'categories': 2, 'nu': 2, 'radius': 2, 'norm': 2, 'skewedness': 1, 'with_centering': 1, 'damping': 1, 'dtype': 1, 'sample_steps': 1, 'gamma': 1, 'check_y': 1, 'transformers': 1, 'n_quantiles': 1, 'dictionary': 1, 'method': 1, 'degree': 1, 'neg_label': 1, 'steps': 1, 'patch_size': 1, 'solver': 1, 'n_features': 1, 'transformer_list': 1, 'func': 1, 'min_samples': 1, 'metric': 1, 'classes': 1, 'feature_range': 1, 'Cs': 1, 'y_min': 1, 'regressor': 1, 'n_nonzero_coefs': 1}

Maybe having a white-list of those that we allow would be useful?
Say, 'n_components', 'alpha', 'estimator', 'base_estimator', 'kernel' (this is not for SVC), 'n_clusters', 'n_estimators', 'n_neighbors', 'C', 'steps', 'regressor', 'transformers'?

Though the C is a bit of an outlier and I think having 'store_precision' be positional would not be very useful so I didn't list it. Generally I think for all meta-estimators the first argument should be positional.

@thomasjpfan
Copy link
Member

Generally I think for all meta-estimators the first argument should be positional.

Agreed.

I prefer narrowing the list down to just clustering, decomposition, and meta estimator parameters: 'n_components', 'estimator', 'base_estimator', 'n_clusters', 'n_neighbors', 'steps', 'regressor', 'transformers'

@amueller
Copy link
Member

Hm that's deprecating LogisticRegression(0.01) and RandomForestClassifier(100)`` ... I'm not super opposed but also could see some resistance?

@jnothman
Copy link
Member Author

LogisticRegression(0.01) isn't a thing: the first parameter is penalty

@jnothman
Copy link
Member Author

jnothman commented Sep 18, 2019 via email

@amueller
Copy link
Member

A major company recently did a giant github scrape and analyzed sklearn usage. I asked them whether they can share their results.

@srggrs
Copy link

srggrs commented Oct 21, 2019

Hi guys,
I did some preliminary analysis for @jnothman using AST and these are the results for the aggregated data and analysis related to the repo/file.

Let me know if you have any questions. I intend to upload the code (need a bit of cleaning!) I used to run this analysis soon, so you guys can have a look.

Briefly a piece of code downloads the repos from a list of repos (grabbed from the dependents tree or by getting a list of repos using Google BigQuery on the GitHub public dataset - in the latter I run out of free usage credits) and convert the *.ipynb to *.py using nbconvert, then another code runs through all repo files in parallel and search, using AST, for callables on objects imported from all_estimators.
The code does not handle case where the import is of this form: from sklearn.ensemble import RandomForestClassifier as RFC.

@thomasjpfan
Copy link
Member

@srggrs Thank you for the analysis!

It looks like nr_pos_args is bounded below by 1. Does this mean that all estimators uses at least one positional argument?

@adrinjalali
Copy link
Member

Also, nr_pos_args max seems to be 1 for many estimators, which looks pretty odd to me.

@jnothman
Copy link
Member Author

It looks like nr_pos_args is bounded below by 1. Does this mean that all estimators uses at least one positional argument?

Because the analysis here is limited to class constructors (next version should not, I think, have this limitation), the first arg is always self, and so the lower bound of 1 makes sense.

@adrinjalali adrinjalali modified the milestones: 0.22, 0.23 Oct 29, 2019
@adrinjalali
Copy link
Member

I'm happy to have in 0.22 if you still think we can have it in 0.22 @jnothman

@jnothman
Copy link
Member Author

jnothman commented Oct 29, 2019 via email

@jnothman
Copy link
Member Author

@adrinjalali I've added a list of the subpackages we've resolved this for, and those still to go, in the PR description.

@rth
Copy link
Member

rth commented Apr 21, 2020

Once this is done, and included in the RC, we should heavily advertise the RC to make sure people discover potential issues before the final release. If there are many complaints, we might need to relax some of the most common positional arguments.

@thomasjpfan
Copy link
Member

Should we leave utils alone?

@NicolasHug
Copy link
Member

Should we leave utils alone?

I'll try to open a PR but I would agree there's no strong need to get it in for the release

@thomasjpfan
Copy link
Member

I'll try to open a PR but I would agree there's no strong need to get it in for the release

Okay, let's see what it looks like. In general, it would be "nice to have" since it would make the library more consistent and promote the usage of * in future util functions.

@adrinjalali
Copy link
Member

I think this one is now complete. There may be missing ones, which we can deal with later with delayed deprecations.

@jnothman
Copy link
Member Author

Thanks for all your effort to everyone involved. This is a great thing for making the parameters more findable in a year's time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants