Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add assert_docstring_consistency checks #30854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
85 tasks
glemaitre opened this issue Feb 18, 2025 · 9 comments
Closed
85 tasks

Add assert_docstring_consistency checks #30854

glemaitre opened this issue Feb 18, 2025 · 9 comments
Labels
Documentation good first issue Easy with clear instructions to resolve Meta-issue General issue associated to an identified list of tasks Sprint

Comments

@glemaitre
Copy link
Member

The assert_docstring_consistency function allows you to check the consistency between docstring parameters/attributes/returns of objects.

In scikit-learn there are often classes that share a parent (e.g., AdaBoostClassifier, AdaBoostRegressor) or related functions (e.g, f1_score, fbeta_score). In these cases, some parameters are often shared/common and we would like to check that the docstring type and description matches.

The assert_docstring_consistency function allows you to include/exclude specific parameters/attibutes/returns. In some cases only part of the description should match between objects. In this case you can use descr_regex_pattern to pass a regular expression to be matched to all descriptions. Please read the docstring of this function carefully.

Guide on how to contribute to this issue:

  1. Pick an item below and comment the item you are working on so others know it has been taken.
    • NOT all items listed require a test to be added. If you find that the item you selected does not require a test, this is still a valuable contribution, please comment the reason why and we can tick it off the list.
  2. Determine common parameters/attributes/returns between the objects.
    • If the description does not match but should, decide on the best wording and amend all objects to match. If only part of the description should match, consider using descr_regex_pattern.
  3. Write a new test.

See #29831 for an example. This PR adds a test for the stacking estimators StackingClassifier and StackingRegressor.

Classes that share a common parent:

  • BaseWeightBoosting: ['AdaBoostClassifier', 'AdaBoostRegressor']
  • BaseBagging: ['BaggingClassifier', 'BaggingRegressor', 'IsolationForest']
  • BaseMixture: ['BayesianGaussianMixture', 'GaussianMixture']
  • _BaseDiscreteNB: ['BernoulliNB', 'CategoricalNB', 'ComplementNB', 'MultinomialNB']
  • _BaseKMeans: ['BisectingKMeans', 'KMeans', 'MiniBatchKMeans']
  • _PLS: ['CCA', 'PLSCanonical', 'PLSRegression']
  • _BaseChain: ['ClassifierChain', 'RegressorChain']
  • _VectorizerMixin: ['CountVectorizer', 'HashingVectorizer', 'TfidfVectorizer']
  • BaseDecisionTree: ['DecisionTreeClassifier', 'DecisionTreeRegressor', 'ExtraTreeClassifier', 'ExtraTreeRegressor']
  • _BaseSparseCoding: ['DictionaryLearning', 'MiniBatchDictionaryLearning', 'SparseCoder']
  • LinearModelCV: ['ElasticNetCV', 'LassoCV', 'MultiTaskElasticNetCV', 'MultiTaskLassoCV']
  • OutlierMixin: ['EllipticEnvelope', 'IsolationForest', 'LocalOutlierFactor', 'OneClassSVM', 'SGDOneClassSVM']
  • EmpiricalCovariance: ['EllipticEnvelope', 'GraphicalLasso', 'GraphicalLassoCV', 'LedoitWolf', 'MinCovDet', 'OAS', 'ShrunkCovariance']
  • ForestClassifier: ['ExtraTreesClassifier', 'RandomForestClassifier']
  • `BaseForest: ['ExtraTreesClassifier', 'ExtraTreesRegressor', 'RandomForestClassifier', 'RandomForestRegressor', 'RandomTreesEmbedding']
  • ForestRegressor: ['ExtraTreesRegressor', 'RandomForestRegressor']
  • BaseThresholdClassifier: ['FixedThresholdClassifier', 'TunedThresholdClassifierCV']
  • _GeneralizedLinearRegressor: ['GammaRegressor', 'PoissonRegressor', 'TweedieRegressor']
  • BaseRandomProjection: ['GaussianRandomProjection', 'SparseRandomProjection']
  • _BaseFilter: ['GenericUnivariateSelect', 'SelectFdr', 'SelectFpr', 'SelectFwe', 'SelectKBest', 'SelectPercentile']
  • BaseGradientBoosting: ['GradientBoostingClassifier', 'GradientBoostingRegressor']
  • BaseGraphicalLasso: ['GraphicalLasso', 'GraphicalLassoCV']
  • BaseSearchCV: ['GridSearchCV', 'RandomizedSearchCV']
  • BaseHistGradientBoosting: ['HistGradientBoostingClassifier', 'HistGradientBoostingRegressor']
  • _BasePCA: ['IncrementalPCA', 'PCA']
  • _BaseImputer: ['KNNImputer', 'SimpleImputer']
  • KNeighborsMixin: ['KNeighborsClassifier', 'KNeighborsRegressor', 'KNeighborsTransformer', 'LocalOutlierFactor', 'NearestNeighbors']
  • NeighborsBase: ['KNeighborsClassifier', 'KNeighborsRegressor', 'KNeighborsTransformer', 'LocalOutlierFactor', 'NearestNeighbors', 'RadiusNeighborsClassifier', 'RadiusNeighborsRegressor', 'RadiusNeighborsTransformer']
  • BaseLabelPropagation: ['LabelPropagation', 'LabelSpreading']
  • Lars: ['LarsCV', 'LassoLars', 'LassoLarsCV', 'LassoLarsIC']
  • ElasticNet: ['Lasso', 'MultiTaskElasticNet', 'MultiTaskLasso']
  • BaseMultilayerPerceptron: ['MLPClassifier', 'MLPRegressor']
  • _BaseNMF: ['MiniBatchNMF', 'NMF']
  • _BaseSparsePCA: ['MiniBatchSparsePCA', 'SparsePCA']
  • _MultiOutputEstimator: ['MultiOutputClassifier', 'MultiOutputRegressor']
  • Lasso: ['MultiTaskElasticNet', 'MultiTaskLasso']
  • RadiusNeighborsMixin: ['NearestNeighbors', 'RadiusNeighborsClassifier', 'RadiusNeighborsRegressor', 'RadiusNeighborsTransformer']
  • BaseSVC: ['NuSVC', 'SVC']
  • BaseLibSVM: ['NuSVC', 'NuSVR', 'OneClassSVM', 'SVC', 'SVR']
  • _BaseEncoder: ['OneHotEncoder', 'OrdinalEncoder', 'TargetEncoder']
  • BaseSGDClassifier: ['PassiveAggressiveClassifier', 'Perceptron', 'SGDClassifier']
  • BaseSGD: ['PassiveAggressiveClassifier', 'PassiveAggressiveRegressor', 'Perceptron', 'SGDClassifier', 'SGDOneClassSVM', 'SGDRegressor']
  • BaseSGDRegressor: ['PassiveAggressiveRegressor', 'SGDRegressor']
  • _BaseRidge: ['Ridge', 'RidgeClassifier']
  • _BaseRidgeCV: ['RidgeCV', 'RidgeClassifierCV']
  • _RidgeClassifierMixin: ['RidgeClassifier', 'RidgeClassifierCV']
  • BaseSpectral: ['SpectralBiclustering', 'SpectralCoclustering']
  • BiclusterMixin: ['SpectralBiclustering', 'SpectralCoclustering']
  • _BaseStacking: ['StackingClassifier', 'StackingRegressor']
  • _BaseHeterogeneousEnsemble: ['StackingClassifier', 'StackingRegressor', 'VotingClassifier', 'VotingRegressor']
  • _BaseVoting: ['VotingClassifier', 'VotingRegressor']

Functions, from the same module, that share parameters.

Details

I did a lot of manual culling as many functions shared only 1 or 2 parameters and were not actually relevant.

The functions are grouped by the parameters shared, so the list of parameters shared is not exhaustive for any subset of functions within the group. The grouping of functions below is not necessarily most ideal for the consistency check.

Module: sklearn.utils

  • Functions: compute_class_weight, compute_sample_weight / Shared parameters: class_weight
  • Functions: resample, shuffle / Shared parameters: random_state

Module: sklearn.utils.class_weight

  • Functions: compute_class_weight, compute_sample_weight / Shared parameters: class_weight, y

Module: sklearn.utils.extmath

  • Functions: randomized_range_finder, randomized_svd / Shared parameters: n_iter, power_iteration_normalizer, random_state

Module: sklearn.utils.validation

  • Functions: as_float_array, check_X_y, check_array / Shared parameters: copy, force_all_finite, ensure_all_finite
  • Functions: check_X_y, check_array / Shared parameters: accept_sparse, accept_large_sparse, order, force_writeable, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features

Module: sklearn.metrics

  • Functions: adjusted_mutual_info_score, adjusted_rand_score, completeness_score, fowlkes_mallows_score, homogeneity_completeness_v_measure, homogeneity_score, mutual_info_score, normalized_mutual_info_score, pair_confusion_matrix, rand_score, v_measure_score / Shared parameters: labels_true, labels_pred

Module: sklearn.metrics.pairwise

  • Functions: pairwise_distances_argmin, pairwise_distances_argmin_min / Shared parameters: axis, metric_kwargs
  • Functions: pairwise_distances, pairwise_distances_chunked, pairwise_kernels / Shared parameters: n_jobs
  • Functions: check_pairwise_arrays, pairwise_distances / Shared parameters: force_all_finite, ensure_all_finite
  • Functions: chi2_kernel, laplacian_kernel, polynomial_kernel, rbf_kernel, sigmoid_kernel / Shared parameters: gamma

Module: sklearn.cluster

  • Functions: affinity_propagation, estimate_bandwidth, k_means, kmeans_plusplus, spectral_clustering / Shared parameters: random_state
  • Functions: cluster_optics_dbscan, cluster_optics_xi / Shared parameters: reachability, ordering
  • Functions: compute_optics_graph, dbscan / Shared parameters: metric, p, metric_params, leaf_size
  • Functions: linkage_tree, ward_tree / Shared parameters: connectivity, return_distance

Module: sklearn.datasets

  • Functions: dump_svmlight_file, load_svmlight_file, load_svmlight_files / Shared parameters: zero_based, query_id, multilabel
  • Functions: fetch_20newsgroups, fetch_20newsgroups_vectorized, fetch_california_housing, fetch_covtype, fetch_file, fetch_kddcup99, fetch_lfw_pairs, fetch_lfw_people, fetch_olivetti_faces, fetch_openml, fetch_rcv1, fetch_species_distributions / Shared parameters: n_retries, delay
  • Functions: fetch_20newsgroups_vectorized, fetch_california_housing, fetch_covtype, fetch_kddcup99, fetch_openml, load_breast_cancer, load_diabetes, load_digits, load_iris, load_linnerud, load_wine / Shared parameters: as_frame
  • Functions: make_biclusters, make_checkerboard / Shared parameters: shape, n_clusters, minval, maxval
  • Functions: make_low_rank_matrix, make_regression / Shared parameters: effective_rank, tail_strength

Module: sklearn.decomposition

  • Functions: dict_learning, dict_learning_online, fastica, non_negative_factorization / Shared parameters: X, max_iter, n_components, random_state
  • Functions: dict_learning, dict_learning_online, sparse_encode / Shared parameters: alpha, n_jobs
  • Functions: dict_learning, dict_learning_online / Shared parameters: method, dict_init, callback, positive_dict, positive_code, method_max_iter

Module: sklearn.feature_extraction

  • Functions: grid_to_graph, img_to_graph / Shared parameters: mask, return_as, dtype, mask, return_as

Module: sklearn.linear_model

  • Functions: lars_path, lars_path_gram, orthogonal_mp_gram / Shared parameters: Gram, copy_Gram
  • Functions: lars_path, lars_path_gram / Shared parameters: alpha_min, method
  • Functions: lars_path, lars_path_gram, ridge_regression / Shared parameters: max_iter
  • Functions: lars_path, lars_path_gram, orthogonal_mp, orthogonal_mp_gram / Shared parameters: return_path
  • Functions: orthogonal_mp, orthogonal_mp_gram / Shared parameters: n_nonzero_coefs, tol

Module: sklearn.neighbors

  • Functions: kneighbors_graph, radius_neighbors_graph / Shared parameters: X, mode, metric, p, metric_params, include_self, n_jobs

Module: sklearn.tree

  • Functions: export_graphviz, export_text, plot_tree / Shared parameters: decision_tree, max_depth, feature_names, class_names
  • Functions: export_graphviz, plot_tree / Shared parameters: label, filled, impurity, node_ids, proportion, rounded, precision

Module: sklearn.feature_selection

  • Functions: f_regression, r_regression / Shared parameters: center, force_finite
  • Functions: mutual_info_classif, mutual_info_regression / Shared parameters: discrete_features, n_neighbors, copy, random_state, n_jobs
@glemaitre glemaitre added Documentation good first issue Easy with clear instructions to resolve Meta-issue General issue associated to an identified list of tasks Sprint labels Feb 18, 2025
@SuhasSridhar2
Copy link

Hi, I'd love to work on this issue as my first contribution!! Could you assign it to me? Or can I contiribute for the issue?

@glemaitre
Copy link
Member Author

We need to wait that #30853 is merged first before to start working on this issue.

Then the issue will not be assigned because it is a meta-issue (several PR will target this issue).

@raviteja-ganta
Copy link

raviteja-ganta commented Feb 18, 2025

Hi @glemaitre. I would like to work on this after 30853 is merged. Can I work on

  1. BaseWeightBoosting: ['AdaBoostClassifier', 'AdaBoostRegressor']

  2. BaseBagging: ['BaggingClassifier', 'BaggingRegressor', 'IsolationForest']

  3. BaseDecisionTree: ['DecisionTreeClassifier', 'DecisionTreeRegressor', 'ExtraTreeClassifier', 'ExtraTreeRegressor'].
    Also I guess they have to go in to different PR's right?

@lc542
Copy link

lc542 commented Feb 18, 2025

Hi @glemaitre:
I would like to contribute to after #30583 is merged:
KNeighborsMixin: ['KNeighborsClassifier', 'KNeighborsRegressor', 'KNeighborsTransformer', 'LocalOutlierFactor', 'NearestNeighbors']
and
NeighborsBase: ['KNeighborsClassifier', 'KNeighborsRegressor', 'KNeighborsTransformer', 'LocalOutlierFactor', 'NearestNeighbors', 'RadiusNeighborsClassifier', 'RadiusNeighborsRegressor', 'RadiusNeighborsTransformer']

Thank you!

@SuhasSridhar2
Copy link

Hi @glemaitre
I would like to contribute after #30583 is merged:
ForestClassifier: ['ExtraTreesClassifier', 'RandomForestClassifier']

`BaseForest: ['ExtraTreesClassifier', 'ExtraTreesRegressor', 'RandomForestClassifier', 'RandomForestRegressor', 'RandomTreesEmbedding']

ForestRegressor: ['ExtraTreesRegressor', 'RandomForestRegressor']

Thank you for the eariler clarification.

@StefanieSenger
Copy link
Contributor

StefanieSenger commented Feb 19, 2025

Dear new contributors:

We're happy you're interested to work on that issue. As for the process:

Please select one consistency check at a time and comment on this issue. Each consistency check should be a separate PR and you claim them one by one. After you have opened a PR, you can claim a new one.

Thank you!

@StefanieSenger
Copy link
Contributor

I'm working on BaseBagging.

@glemaitre
Copy link
Member Author

OK, @StefanieSenger found out that we are going to have trouble when implementing those changes for all estimators. Let's close the issue for the moment. We will test a bit more to be sure that implementing those consistency checks will not lead to just copy pasting the scikit-learn documentation via some regex in the file.

Sorry for the noise but we are not ready yet ;) @StefanieSenger feel free to bring more details that can help us thinking on how to improve the current assert function.

@StefanieSenger
Copy link
Contributor

StefanieSenger commented Feb 19, 2025

(I was testing this under time pressure, because we have a first time contributor sprint starting in a few hours and I needed to test this issue by then. It feels safer to disrupt that issue for now and re-open it when we know what we want. Surely @lucyleeow knows more about it, but I cannot access her knowledge right now.)

In general, I think we need to work on defining the test file or define a testing policy to have clear what we expect from the issue.

I will describe the problem that I encountered adding a test for the BaseBagging classifiers like this:

    {
        "objects": [BaggingClassifier, BaggingRegressor, IsolationForest],
        "include_params": ["n_estimators" "max_samples", "max_features", "bootstrap", "n_jobs", "warm_start", "random_state", "verbose"],
        "exclude_params": None,
        "include_attrs": ["estimator_", "n_features_in_", "feature_names_in_", "estimators_", "estimators_samples_", "estimators_features_"],
        "exclude_attrs": None,
        "include_returns": False,
        "exclude_returns": None,
        "descr_regex_pattern": None,
    },

This way of testing leads to the problem that "n_estimators", "max_samples" and "bootstrap" should overall have the same documentation but they have different defaults, so the added test fails (which it should). The desired way to work around that would be to add a new test for each of these separately using the "descr_regex_pattern" param of the test. This would result in three additional testing dicts like the one above but including a regex pattern, apart from the one testing all the other params and attributes, because for the moment, if I understand correctly, there is no way to pass a regex AND test several params/attributes at the same time.

A possible solution would be to allow a regex to be passed per param/attribute (maybe as a dict).

We could also have some testing policy in place saying that we would radically exclude params/attributes that are too wordy to add tests for and only use "descr_regex_pattern" in selected cases (that we pre-define before.)

@glemaitre was also concerned that using "descr_regex_pattern" too much we would basically re-write our whole documentation and the file would become not very long, but very, very, very long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation good first issue Easy with clear instructions to resolve Meta-issue General issue associated to an identified list of tasks Sprint
Projects
None yet
Development

No branches or pull requests

5 participants