-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Add assert_docstring_consistency
checks
#30854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, I'd love to work on this issue as my first contribution!! Could you assign it to me? Or can I contiribute for the issue? |
We need to wait that #30853 is merged first before to start working on this issue. Then the issue will not be assigned because it is a meta-issue (several PR will target this issue). |
Hi @glemaitre. I would like to work on this after 30853 is merged. Can I work on
|
Hi @glemaitre: Thank you! |
Hi @glemaitre `BaseForest: ['ExtraTreesClassifier', 'ExtraTreesRegressor', 'RandomForestClassifier', 'RandomForestRegressor', 'RandomTreesEmbedding'] ForestRegressor: ['ExtraTreesRegressor', 'RandomForestRegressor'] Thank you for the eariler clarification. |
Dear new contributors: We're happy you're interested to work on that issue. As for the process: Please select one consistency check at a time and comment on this issue. Each consistency check should be a separate PR and you claim them one by one. After you have opened a PR, you can claim a new one. Thank you! |
I'm working on |
OK, @StefanieSenger found out that we are going to have trouble when implementing those changes for all estimators. Let's close the issue for the moment. We will test a bit more to be sure that implementing those consistency checks will not lead to just copy pasting the scikit-learn documentation via some regex in the file. Sorry for the noise but we are not ready yet ;) @StefanieSenger feel free to bring more details that can help us thinking on how to improve the current assert function. |
(I was testing this under time pressure, because we have a first time contributor sprint starting in a few hours and I needed to test this issue by then. It feels safer to disrupt that issue for now and re-open it when we know what we want. Surely @lucyleeow knows more about it, but I cannot access her knowledge right now.) In general, I think we need to work on defining the test file or define a testing policy to have clear what we expect from the issue. I will describe the problem that I encountered adding a test for the
This way of testing leads to the problem that "n_estimators", "max_samples" and "bootstrap" should overall have the same documentation but they have different defaults, so the added test fails (which it should). The desired way to work around that would be to add a new test for each of these separately using the "descr_regex_pattern" param of the test. This would result in three additional testing dicts like the one above but including a regex pattern, apart from the one testing all the other params and attributes, because for the moment, if I understand correctly, there is no way to pass a regex AND test several params/attributes at the same time. A possible solution would be to allow a regex to be passed per param/attribute (maybe as a dict). We could also have some testing policy in place saying that we would radically exclude params/attributes that are too wordy to add tests for and only use "descr_regex_pattern" in selected cases (that we pre-define before.) @glemaitre was also concerned that using "descr_regex_pattern" too much we would basically re-write our whole documentation and the file would become not very long, but very, very, very long. |
The
assert_docstring_consistency
function allows you to check the consistency between docstring parameters/attributes/returns of objects.In scikit-learn there are often classes that share a parent (e.g.,
AdaBoostClassifier
,AdaBoostRegressor
) or related functions (e.g,f1_score
,fbeta_score
). In these cases, some parameters are often shared/common and we would like to check that the docstring type and description matches.The
assert_docstring_consistency
function allows you to include/exclude specific parameters/attibutes/returns. In some cases only part of the description should match between objects. In this case you can usedescr_regex_pattern
to pass a regular expression to be matched to all descriptions. Please read the docstring of this function carefully.Guide on how to contribute to this issue:
descr_regex_pattern
.sklearn/tests/test_docstring_parameters_consistency.py
(cf. TST move test for parameters consistency checks #30853)@skip_if_no_numpydoc
to the top of the test (these tests can only be run if numpydoc is installed)See #29831 for an example. This PR adds a test for the stacking estimators
StackingClassifier
andStackingRegressor
.Classes that share a common parent:
BaseWeightBoosting
: ['AdaBoostClassifier', 'AdaBoostRegressor']BaseBagging
: ['BaggingClassifier', 'BaggingRegressor', 'IsolationForest']BaseMixture
: ['BayesianGaussianMixture', 'GaussianMixture']_BaseDiscreteNB
: ['BernoulliNB', 'CategoricalNB', 'ComplementNB', 'MultinomialNB']_BaseKMeans
: ['BisectingKMeans', 'KMeans', 'MiniBatchKMeans']_PLS
: ['CCA', 'PLSCanonical', 'PLSRegression']_BaseChain
: ['ClassifierChain', 'RegressorChain']_VectorizerMixin
: ['CountVectorizer', 'HashingVectorizer', 'TfidfVectorizer']BaseDecisionTree
: ['DecisionTreeClassifier', 'DecisionTreeRegressor', 'ExtraTreeClassifier', 'ExtraTreeRegressor']_BaseSparseCoding
: ['DictionaryLearning', 'MiniBatchDictionaryLearning', 'SparseCoder']LinearModelCV
: ['ElasticNetCV', 'LassoCV', 'MultiTaskElasticNetCV', 'MultiTaskLassoCV']OutlierMixin
: ['EllipticEnvelope', 'IsolationForest', 'LocalOutlierFactor', 'OneClassSVM', 'SGDOneClassSVM']EmpiricalCovariance
: ['EllipticEnvelope', 'GraphicalLasso', 'GraphicalLassoCV', 'LedoitWolf', 'MinCovDet', 'OAS', 'ShrunkCovariance']ForestClassifier
: ['ExtraTreesClassifier', 'RandomForestClassifier']ForestRegressor
: ['ExtraTreesRegressor', 'RandomForestRegressor']BaseThresholdClassifier
: ['FixedThresholdClassifier', 'TunedThresholdClassifierCV']_GeneralizedLinearRegressor
: ['GammaRegressor', 'PoissonRegressor', 'TweedieRegressor']BaseRandomProjection
: ['GaussianRandomProjection', 'SparseRandomProjection']_BaseFilter
: ['GenericUnivariateSelect', 'SelectFdr', 'SelectFpr', 'SelectFwe', 'SelectKBest', 'SelectPercentile']BaseGradientBoosting
: ['GradientBoostingClassifier', 'GradientBoostingRegressor']BaseGraphicalLasso
: ['GraphicalLasso', 'GraphicalLassoCV']BaseSearchCV
: ['GridSearchCV', 'RandomizedSearchCV']BaseHistGradientBoosting
: ['HistGradientBoostingClassifier', 'HistGradientBoostingRegressor']_BasePCA
: ['IncrementalPCA', 'PCA']_BaseImputer
: ['KNNImputer', 'SimpleImputer']KNeighborsMixin
: ['KNeighborsClassifier', 'KNeighborsRegressor', 'KNeighborsTransformer', 'LocalOutlierFactor', 'NearestNeighbors']NeighborsBase
: ['KNeighborsClassifier', 'KNeighborsRegressor', 'KNeighborsTransformer', 'LocalOutlierFactor', 'NearestNeighbors', 'RadiusNeighborsClassifier', 'RadiusNeighborsRegressor', 'RadiusNeighborsTransformer']BaseLabelPropagation
: ['LabelPropagation', 'LabelSpreading']Lars
: ['LarsCV', 'LassoLars', 'LassoLarsCV', 'LassoLarsIC']ElasticNet
: ['Lasso', 'MultiTaskElasticNet', 'MultiTaskLasso']BaseMultilayerPerceptron
: ['MLPClassifier', 'MLPRegressor']_BaseNMF
: ['MiniBatchNMF', 'NMF']_BaseSparsePCA
: ['MiniBatchSparsePCA', 'SparsePCA']_MultiOutputEstimator
: ['MultiOutputClassifier', 'MultiOutputRegressor']Lasso
: ['MultiTaskElasticNet', 'MultiTaskLasso']RadiusNeighborsMixin
: ['NearestNeighbors', 'RadiusNeighborsClassifier', 'RadiusNeighborsRegressor', 'RadiusNeighborsTransformer']BaseSVC
: ['NuSVC', 'SVC']BaseLibSVM
: ['NuSVC', 'NuSVR', 'OneClassSVM', 'SVC', 'SVR']_BaseEncoder
: ['OneHotEncoder', 'OrdinalEncoder', 'TargetEncoder']BaseSGDClassifier
: ['PassiveAggressiveClassifier', 'Perceptron', 'SGDClassifier']BaseSGD
: ['PassiveAggressiveClassifier', 'PassiveAggressiveRegressor', 'Perceptron', 'SGDClassifier', 'SGDOneClassSVM', 'SGDRegressor']BaseSGDRegressor
: ['PassiveAggressiveRegressor', 'SGDRegressor']_BaseRidge
: ['Ridge', 'RidgeClassifier']_BaseRidgeCV
: ['RidgeCV', 'RidgeClassifierCV']_RidgeClassifierMixin
: ['RidgeClassifier', 'RidgeClassifierCV']BaseSpectral
: ['SpectralBiclustering', 'SpectralCoclustering']BiclusterMixin
: ['SpectralBiclustering', 'SpectralCoclustering']_BaseStacking
: ['StackingClassifier', 'StackingRegressor']_BaseHeterogeneousEnsemble
: ['StackingClassifier', 'StackingRegressor', 'VotingClassifier', 'VotingRegressor']_BaseVoting
: ['VotingClassifier', 'VotingRegressor']Functions, from the same module, that share parameters.
Details
I did a lot of manual culling as many functions shared only 1 or 2 parameters and were not actually relevant.
The functions are grouped by the parameters shared, so the list of parameters shared is not exhaustive for any subset of functions within the group. The grouping of functions below is not necessarily most ideal for the consistency check.
Module: sklearn.utils
Module: sklearn.utils.class_weight
Module: sklearn.utils.extmath
Module: sklearn.utils.validation
Module: sklearn.metrics
Module: sklearn.metrics.pairwise
Module: sklearn.cluster
Module: sklearn.datasets
Module: sklearn.decomposition
Module: sklearn.feature_extraction
Module: sklearn.linear_model
Module: sklearn.neighbors
Module: sklearn.tree
Module: sklearn.feature_selection
The text was updated successfully, but these errors were encountered: