Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Use the function check_scalar for parameters validation #21927

Closed
TheBicPen/scikit-learn
#4
@reshamas

Description

@reshamas

Background / Objective

Use the function check_scalar for parameters validation. The validation function checks to see the following for a parameter: is an acceptable data type, is within the range of values, the range of values (interval).

A helper function exists in scikit-learn which validates a scalar value: sklearn.utils.check_scalar documentation.
It is used to validate parameters of classes (? and functions). Most of the current classes in scikit-learn do not use this helper function. We want to refactor the code so that it does use this standard helper function. Utilizing this helper function will help to get consistent error types and messages.

If there is a scalar argument that isn't being checked, we want to check it, or validate it using the check_scalar function. In some cases it is currently being checked, but it is not using the check_scalar function. For that change, we refactor the code. (Refactoring means making changes to the code that result in the same output as before.)

The function check_scalar is defined in scikit-learn/sklearn/utils/validation.py.

Prerequisites

This is an Intermediate-level issue for second time contributors. This requires the following experience:

  • You have already set up your working virtual environment.
  • You have submitted at least one other pull request to this library. (You are familiar with using git and submitting pull requests.)
  • Be familiar with the scikit-learn code base.
  • Experience using pytest.
  • To find the range of possible for values for an estimator, that information might be available if some validation code has already been written in the scikit-learn library.
  • Sometimes validation code is not available in the scikit-learn library. It is helpful to be familiar with the acceptable range of values (minimum and maximum) for the arguments for the estimator you are working on. If you are not familiar with an estimator, you can reference other sources outside of scikit-learn documentation to get that information.

Steps

  • Make sure you have activated your virtual environment.
  • Make sure you have created a separate branch from main before editing files for your new contribution. Refer to our contributing guidelines for more information.
  • Find a class with constructors that have scalar numeric as parameters. There are some listed below in the "Classes to Update" section.
  • Work on one estimator at a time and submit each in a separate pull request.
  • Identify the scalar numeric parameters (those of type int, float) for that class.
    • Examples of scalar parameters are: alpha,damping, max_iter, and convergence_iter, tol, verbose.
    • You can infer if it is a type scalar by looking at the documentation.
    • Example PR: AffinityPropagation scalar parameters
  • For each of the scalar numeric parameters, determine the acceptable range of values. Look at minimum and maximum values. Sometimes that information is included in the parameter definition in the documentation. Sometimes you may need to reference other sources. If minimum and maximum values are missing, we should add them.
  • Add tests. Note: the tests must fail before adding validation. Example PR by @glemaitre added a parametrised test for parameters.
  • If any of the associated class attributes, which are scalar numeric, but are not being checked with check_scalar, are ones that can be done.
  • Validation should be within the def fit function. Validation is when check_scalar is added to the class. Add check_scalar calls where needed. Generally, this is not done in the constructor but rather just before calling the core of the method. For instance, in the case of MNT use check_scalar to validate scalar in AffinityPropagation #20723, @glemaitre added check_scalar calls just before the call to affinity_propagation which is the core of the method.

Notes

  • The pull request can be named: "MAINT Use check_scalar to validate scalar in: [EstimatorName]"
  • Work on one estimator at a time and submit each in a separate pull request.
  • Within an estimator there may be multiple scalar arguments. (For one estimator, validation for multiple arguments - should be submitted in one pull request.)
  • Include explicit parameter names (even if they are not required), as a best practice. In this function, the parameter name is not required, meaning it is not a keyword on the argument. You should include it in the function call for readability.
check_scalar(
  self.learning_rate,
  name="learning_rate",
  target_type=numbers.Real,
  min_val=0,
  max_val=None,  #default
  include_boundaries="both", #default
)

Tests

Suggestion: You may want to write the test before writing the validation code. When doing the test first, it gives you an idea of where the existing validation is. If validation exists, it will give you the range of possible values. Writing the test lets you check for that.

Generally speaking, this is how to connect the .py file with its associated test. Check to see if the test exists in the test_*.py file. If it does not, we will need to create a test.

  • Where the class is: sklearn/cluster/_affinity_propagation.py
  • Where the related class test file is: sklearn/cluster/tests/test_affinity_propagation.py
  • The name of the test: def test_affinity_propagation_params_validation(....)

The point of a test is that if an incorrect parameter value is given, the program gives an error message. We want to test for values that are outside of the acceptable range. We want to make sure the program is catching that.
To run an individual validation test, here are examples of the code to run at the terminal:

  • pytest sklearn/cluster/tests/test_affinity_propagation.py::test_affinity_propagation_params_validation
  • pytest sklearn/linear_model/_glm/tests/test_glm.py::test_glm_max_iter_argument

Consistency Checks for Reviewers

  1. PR prefix should be MAINT (not MNT)
  2. check_scalar call should include explicitly include name (Ex: name="n_estimators", (not "n_estimators", ))
  3. Interval ranges should use the text must be (not should be)
  4. Ensure error messages in tests are present

Examples for Reference

Classes Updated

Classes to Update

  • sklearn/linear_model/_coordinate_descent.py (Lasso) (@ArturoAmorQ)
  • sklearn/linear_model/_stochastic_gradient.py (SGDClassifier) (@reshamas)
    - add valid intervals: #22115
  • sklearn/linear_model/_bayes (BayesianRidge) (@matiasrvazquez)
  • sklearn/linear_model/_bayes (ARDRegression) (@matiasrvazquez)
  • sklearn/ensemble/_stacking.py (StackingClassifier) (@genvalen)
  • sklearn/ensemble/_stacking.py (StackingRegressor) (@genvalen)

Metadata

Metadata

Assignees

No one assigned

    Labels

    ModerateAnything that requires some knowledge of conventions and best practiceshelp wanted

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions