Add `assert_docstring_consistency` checks #30854

glemaitre · 2025-02-18T17:20:52Z

The assert_docstring_consistency function allows you to check the consistency between docstring parameters/attributes/returns of objects.

In scikit-learn there are often classes that share a parent (e.g., AdaBoostClassifier, AdaBoostRegressor) or related functions (e.g, f1_score, fbeta_score). In these cases, some parameters are often shared/common and we would like to check that the docstring type and description matches.

The assert_docstring_consistency function allows you to include/exclude specific parameters/attibutes/returns. In some cases only part of the description should match between objects. In this case you can use descr_regex_pattern to pass a regular expression to be matched to all descriptions. Please read the docstring of this function carefully.

Guide on how to contribute to this issue:

Pick an item below and comment the item you are working on so others know it has been taken.
- NOT all items listed require a test to be added. If you find that the item you selected does not require a test, this is still a valuable contribution, please comment the reason why and we can tick it off the list.
Determine common parameters/attributes/returns between the objects.
- If the description does not match but should, decide on the best wording and amend all objects to match. If only part of the description should match, consider using descr_regex_pattern.
Write a new test.
- The test should live in sklearn/tests/test_docstring_parameters_consistency.py (cf. TST move test for parameters consistency checks #30853)
- Add @skip_if_no_numpydoc to the top of the test (these tests can only be run if numpydoc is installed)

See #29831 for an example. This PR adds a test for the stacking estimators StackingClassifier and StackingRegressor.

Classes that share a common parent:

Functions, from the same module, that share parameters.

Details

I did a lot of manual culling as many functions shared only 1 or 2 parameters and were not actually relevant.

The functions are grouped by the parameters shared, so the list of parameters shared is not exhaustive for any subset of functions within the group. The grouping of functions below is not necessarily most ideal for the consistency check.

Module: sklearn.utils

Functions: compute_class_weight, compute_sample_weight / Shared parameters: class_weight
Functions: resample, shuffle / Shared parameters: random_state

Module: sklearn.utils.class_weight

Functions: compute_class_weight, compute_sample_weight / Shared parameters: class_weight, y

Module: sklearn.utils.extmath

Functions: randomized_range_finder, randomized_svd / Shared parameters: n_iter, power_iteration_normalizer, random_state

Module: sklearn.utils.validation

Functions: as_float_array, check_X_y, check_array / Shared parameters: copy, force_all_finite, ensure_all_finite
Functions: check_X_y, check_array / Shared parameters: accept_sparse, accept_large_sparse, order, force_writeable, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features

Module: sklearn.metrics

Functions: adjusted_mutual_info_score, adjusted_rand_score, completeness_score, fowlkes_mallows_score, homogeneity_completeness_v_measure, homogeneity_score, mutual_info_score, normalized_mutual_info_score, pair_confusion_matrix, rand_score, v_measure_score / Shared parameters: labels_true, labels_pred

Module: sklearn.metrics.pairwise

Functions: pairwise_distances_argmin, pairwise_distances_argmin_min / Shared parameters: axis, metric_kwargs
Functions: pairwise_distances, pairwise_distances_chunked, pairwise_kernels / Shared parameters: n_jobs
Functions: check_pairwise_arrays, pairwise_distances / Shared parameters: force_all_finite, ensure_all_finite
Functions: chi2_kernel, laplacian_kernel, polynomial_kernel, rbf_kernel, sigmoid_kernel / Shared parameters: gamma

Module: sklearn.cluster

Functions: affinity_propagation, estimate_bandwidth, k_means, kmeans_plusplus, spectral_clustering / Shared parameters: random_state
Functions: cluster_optics_dbscan, cluster_optics_xi / Shared parameters: reachability, ordering
Functions: compute_optics_graph, dbscan / Shared parameters: metric, p, metric_params, leaf_size
Functions: linkage_tree, ward_tree / Shared parameters: connectivity, return_distance

Module: sklearn.datasets

Functions: dump_svmlight_file, load_svmlight_file, load_svmlight_files / Shared parameters: zero_based, query_id, multilabel
Functions: fetch_20newsgroups, fetch_20newsgroups_vectorized, fetch_california_housing, fetch_covtype, fetch_file, fetch_kddcup99, fetch_lfw_pairs, fetch_lfw_people, fetch_olivetti_faces, fetch_openml, fetch_rcv1, fetch_species_distributions / Shared parameters: n_retries, delay
Functions: fetch_20newsgroups_vectorized, fetch_california_housing, fetch_covtype, fetch_kddcup99, fetch_openml, load_breast_cancer, load_diabetes, load_digits, load_iris, load_linnerud, load_wine / Shared parameters: as_frame
Functions: make_biclusters, make_checkerboard / Shared parameters: shape, n_clusters, minval, maxval
Functions: make_low_rank_matrix, make_regression / Shared parameters: effective_rank, tail_strength

Module: sklearn.decomposition

Functions: dict_learning, dict_learning_online, fastica, non_negative_factorization / Shared parameters: X, max_iter, n_components, random_state
Functions: dict_learning, dict_learning_online, sparse_encode / Shared parameters: alpha, n_jobs
Functions: dict_learning, dict_learning_online / Shared parameters: method, dict_init, callback, positive_dict, positive_code, method_max_iter

Module: sklearn.feature_extraction

Functions: grid_to_graph, img_to_graph / Shared parameters: mask, return_as, dtype, mask, return_as

Module: sklearn.linear_model

Functions: lars_path, lars_path_gram, orthogonal_mp_gram / Shared parameters: Gram, copy_Gram
Functions: lars_path, lars_path_gram / Shared parameters: alpha_min, method
Functions: lars_path, lars_path_gram, ridge_regression / Shared parameters: max_iter
Functions: lars_path, lars_path_gram, orthogonal_mp, orthogonal_mp_gram / Shared parameters: return_path
Functions: orthogonal_mp, orthogonal_mp_gram / Shared parameters: n_nonzero_coefs, tol

Module: sklearn.neighbors

Functions: kneighbors_graph, radius_neighbors_graph / Shared parameters: X, mode, metric, p, metric_params, include_self, n_jobs

Module: sklearn.tree

Functions: export_graphviz, export_text, plot_tree / Shared parameters: decision_tree, max_depth, feature_names, class_names
Functions: export_graphviz, plot_tree / Shared parameters: label, filled, impurity, node_ids, proportion, rounded, precision

Module: sklearn.feature_selection

Functions: f_regression, r_regression / Shared parameters: center, force_finite
Functions: mutual_info_classif, mutual_info_regression / Shared parameters: discrete_features, n_neighbors, copy, random_state, n_jobs

The text was updated successfully, but these errors were encountered:

SuhasSridhar2 · 2025-02-18T18:27:50Z

Hi, I'd love to work on this issue as my first contribution!! Could you assign it to me? Or can I contiribute for the issue?

glemaitre · 2025-02-18T18:30:04Z

We need to wait that #30853 is merged first before to start working on this issue.

Then the issue will not be assigned because it is a meta-issue (several PR will target this issue).

raviteja-ganta · 2025-02-18T21:00:27Z

Hi @glemaitre. I would like to work on this after 30853 is merged. Can I work on

BaseWeightBoosting: ['AdaBoostClassifier', 'AdaBoostRegressor']
BaseBagging: ['BaggingClassifier', 'BaggingRegressor', 'IsolationForest']
BaseDecisionTree: ['DecisionTreeClassifier', 'DecisionTreeRegressor', 'ExtraTreeClassifier', 'ExtraTreeRegressor'].
Also I guess they have to go in to different PR's right?

lc542 · 2025-02-18T22:42:56Z

Hi @glemaitre:
I would like to contribute to after #30583 is merged:
KNeighborsMixin: ['KNeighborsClassifier', 'KNeighborsRegressor', 'KNeighborsTransformer', 'LocalOutlierFactor', 'NearestNeighbors']
and
NeighborsBase: ['KNeighborsClassifier', 'KNeighborsRegressor', 'KNeighborsTransformer', 'LocalOutlierFactor', 'NearestNeighbors', 'RadiusNeighborsClassifier', 'RadiusNeighborsRegressor', 'RadiusNeighborsTransformer']

Thank you!

SuhasSridhar2 · 2025-02-19T00:30:10Z

Hi @glemaitre
I would like to contribute after #30583 is merged:
ForestClassifier: ['ExtraTreesClassifier', 'RandomForestClassifier']

`BaseForest: ['ExtraTreesClassifier', 'ExtraTreesRegressor', 'RandomForestClassifier', 'RandomForestRegressor', 'RandomTreesEmbedding']

ForestRegressor: ['ExtraTreesRegressor', 'RandomForestRegressor']

Thank you for the eariler clarification.

StefanieSenger · 2025-02-19T08:29:06Z

Dear new contributors:

We're happy you're interested to work on that issue. As for the process:

Please select one consistency check at a time and comment on this issue. Each consistency check should be a separate PR and you claim them one by one. After you have opened a PR, you can claim a new one.

Thank you!

StefanieSenger · 2025-02-19T12:51:28Z

I'm working on BaseBagging.

glemaitre · 2025-02-19T13:48:29Z

OK, @StefanieSenger found out that we are going to have trouble when implementing those changes for all estimators. Let's close the issue for the moment. We will test a bit more to be sure that implementing those consistency checks will not lead to just copy pasting the scikit-learn documentation via some regex in the file.

Sorry for the noise but we are not ready yet ;) @StefanieSenger feel free to bring more details that can help us thinking on how to improve the current assert function.

StefanieSenger · 2025-02-19T14:40:41Z

(I was testing this under time pressure, because we have a first time contributor sprint starting in a few hours and I needed to test this issue by then. It feels safer to disrupt that issue for now and re-open it when we know what we want. Surely @lucyleeow knows more about it, but I cannot access her knowledge right now.)

In general, I think we need to work on defining the test file or define a testing policy to have clear what we expect from the issue.

I will describe the problem that I encountered adding a test for the BaseBagging classifiers like this:

    {
        "objects": [BaggingClassifier, BaggingRegressor, IsolationForest],
        "include_params": ["n_estimators" "max_samples", "max_features", "bootstrap", "n_jobs", "warm_start", "random_state", "verbose"],
        "exclude_params": None,
        "include_attrs": ["estimator_", "n_features_in_", "feature_names_in_", "estimators_", "estimators_samples_", "estimators_features_"],
        "exclude_attrs": None,
        "include_returns": False,
        "exclude_returns": None,
        "descr_regex_pattern": None,
    },

This way of testing leads to the problem that "n_estimators", "max_samples" and "bootstrap" should overall have the same documentation but they have different defaults, so the added test fails (which it should). The desired way to work around that would be to add a new test for each of these separately using the "descr_regex_pattern" param of the test. This would result in three additional testing dicts like the one above but including a regex pattern, apart from the one testing all the other params and attributes, because for the moment, if I understand correctly, there is no way to pass a regex AND test several params/attributes at the same time.

A possible solution would be to allow a regex to be passed per param/attribute (maybe as a dict).

We could also have some testing policy in place saying that we would radically exclude params/attributes that are too wordy to add tests for and only use "descr_regex_pattern" in selected cases (that we pre-define before.)

@glemaitre was also concerned that using "descr_regex_pattern" too much we would basically re-write our whole documentation and the file would become not very long, but very, very, very long.

glemaitre added Documentation good first issue Easy with clear instructions to resolve Meta-issue General issue associated to an identified list of tasks Sprint labels Feb 18, 2025

glemaitre closed this as completed Feb 19, 2025

lucyleeow mentioned this issue Feb 24, 2025

TST check if docstring items are equal between objects (functions, classes, etc.) #28678

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add `assert_docstring_consistency` checks #30854

Add `assert_docstring_consistency` checks #30854

glemaitre commented Feb 18, 2025

SuhasSridhar2 commented Feb 18, 2025

Uh oh!

glemaitre commented Feb 18, 2025

Uh oh!

raviteja-ganta commented Feb 18, 2025 •

edited

Loading

Uh oh!

lc542 commented Feb 18, 2025

Uh oh!

SuhasSridhar2 commented Feb 19, 2025

Uh oh!

StefanieSenger commented Feb 19, 2025 •

edited

Loading

Uh oh!

StefanieSenger commented Feb 19, 2025

Uh oh!

glemaitre commented Feb 19, 2025

Uh oh!

StefanieSenger commented Feb 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add assert_docstring_consistency checks #30854

Add assert_docstring_consistency checks #30854

Comments

glemaitre commented Feb 18, 2025

SuhasSridhar2 commented Feb 18, 2025

Uh oh!

glemaitre commented Feb 18, 2025

Uh oh!

raviteja-ganta commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lc542 commented Feb 18, 2025

Uh oh!

SuhasSridhar2 commented Feb 19, 2025

Uh oh!

StefanieSenger commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger commented Feb 19, 2025

Uh oh!

glemaitre commented Feb 19, 2025

Uh oh!

StefanieSenger commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Add `assert_docstring_consistency` checks #30854

Add `assert_docstring_consistency` checks #30854

raviteja-ganta commented Feb 18, 2025 •

edited

Loading

StefanieSenger commented Feb 19, 2025 •

edited

Loading

StefanieSenger commented Feb 19, 2025 •

edited

Loading