Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MAINT Use check_scalar to validate scalar in: GeneralizedLinearRegressor #21946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jan 24, 2022

Conversation

reshamas
Copy link
Member

Reference Issues/PRs

References #21927

What does this implement/fix? Explain your changes.

  • File: sklearn/linear_model/_glm/glm.py
  • Class: GeneralizedLinearRegressor
  • Identify the parameters which are scalar.
  • Add tests to ensure appropriate behavior when invalid arguments are passed in.
  • Implement the helper function check_scalar from sklearn.utils to validate the scalar parameters.

Parameters:

  • alpha
    • Add tests
    • Use check_scalar
  • max_iter
    • Add tests
    • Use check_scalar
  • tol
    • Add tests
    • Use check_scalar
  • verbose
    • Add tests
    • Use check_scalar

Any other comments?

#DataUmbrella #postsprint

@reshamas reshamas changed the title MNT Use check_scalar to validate scalar in: GeneralizedLinearRegressor MAINT Use check_scalar to validate scalar in: GeneralizedLinearRegressor Dec 10, 2021
@glemaitre
Copy link
Member

You should remove:

  • test_glm_tol_argument
  • test_glm_alpha_argument

These tests will be covered by the more generic test functions that you created and parametrized.

@glemaitre glemaitre removed their request for review December 15, 2021 14:58
@reshamas
Copy link
Member Author

@glemaitre

You should remove:

  • test_glm_tol_argument
  • test_glm_alpha_argument

These tests will be covered by the more generic test functions that you created and parametrized.

The tol parametrization I added only tests for negative values. Should I also add in a test for string?
Same for alpha. Should I add in a test that checks for string?

@glemaitre
Copy link
Member

glemaitre commented Dec 15, 2021

The tol parametrization I added only tests for negative values. Should I also add in a test for string?
Same for alpha. Should I add in a test that checks for string?

Yes, we should test for a type of data that should not be supported to ensure that we raise a TypeError.

@reshamas
Copy link
Member Author

The tol parametrization I added only tests for negative values. Should I also add in a test for string?
Same for alpha. Should I add in a test that checks for string?

Yes, we should test for a type of data that should not be supported to ensure that we raise a TypeError.

OK, I think I will add it for verbose as well.

@reshamas
Copy link
Member Author

@glemaitre
I see this error. But, I am not sure what to do to fix it.

Should I change my code so it reads: 'stopping criteria must be positive'

tol = -1.0

    @pytest.mark.parametrize("tol", ["not a number", 0, -1.0, [1e-3]])
    def test_glm_tol_argument(tol):
        """Test GLM for invalid tol argument."""
        y = np.array([1, 2])
        X = np.array([[1], [2]])
        glm = GeneralizedLinearRegressor(tol=tol)
        with pytest.raises(ValueError, match="stopping criteria must be positive"):
>           glm.fit(X, y)
E           AssertionError: Regex pattern 'stopping criteria must be positive' does not match 'tol == -1.0, must be >= 0.'.

/h

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @reshamas !

@@ -72,6 +73,7 @@ class GeneralizedLinearRegressor(RegressorMixin, BaseEstimator):
regularization strength. ``alpha = 0`` is equivalent to unpenalized
GLMs. In this case, the design matrix `X` must have full column rank
(no collinearities).
Values should be >=0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to show ranges, I think we should be consistent with #21955 (When that PR gets merged)

@reshamas
Copy link
Member Author

reshamas commented Dec 20, 2021

Reminders:

  • tol test for 0 (boundary), test for list item
  • remove blank space before internal range values

@reshamas
Copy link
Member Author

@thomasjpfan @glemaitre Is the below correct?

  • I don't need to add any tests for the other 3 classes in this file. (PoissonRegressor), (GammaRegressor), (TweedieRegressor).
  • I do need to add the valid intervals for the parameters.
  • I can probably add the valid intervals for the scalar parameters for the other 3 classes to this PR.

@glemaitre
Copy link
Member

I don't need to add any tests for the other 3 classes in this file. (PoissonRegressor), (GammaRegressor), (TweedieRegressor).

Actually, we should. But it could be easier because I would expect that all these models to have the same API than GeneralizedLinearModel meaning that adding

@pytest.mark.parametrize("Estimator", [GeneralizedLinearModel, TweedieRegressor, ...])

should be enough.

I do need to add the valid intervals for the parameters.

Yes it would be better.

I can probably add the valid intervals for the scalar parameters for the other 3 classes to this PR.

Let's first merge this PR. It has a reasonable size and it will be easier to review.

@reshamas
Copy link
Member Author

reshamas commented Jan 5, 2022

@glemaitre I think this PR is in a good place for review.
I added the intervals in #22076.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @reshamas for this contribution.

Here are a few comments.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM


warm_start : bool, default=False
If set to ``True``, reuse the solution of the previous call to ``fit``
as initialization for ``coef_`` and ``intercept_``.

verbose : int, default=0
For the lbfgs solver set verbose to any positive number for verbosity.
Values must be in the range `[1, inf)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Values must be in the range `[1, inf)`.
Values must be in the range `[0, inf)`.

name="alpha",
target_type=numbers.Real,
min_val=0.0,
include_boundaries="left",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we are not consistent here, but since the default is both, I think we can leave out 'left'

Suggested change
include_boundaries="left",

Also, this is consistent with the rest of this PR that leaves out include_boundaries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan Not sure I understand why we are leaving it "both" when it's "left"?

Copy link
Member

@thomasjpfan thomasjpfan Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With max_val set to None, "left" and "both" means the same thing in terms of the upper bound because the upper bound is not checked.

For alpha, I think it is technically "both" because np.inf is a valid value for alpha in the GLMs:

from sklearn import linear_model
import numpy as np

clf = linear_model.PoissonRegressor(alpha=np.inf)
X = [[1, 2], [2, 3], [3, 4], [4, 3]]
y = [12, 17, 22, 21]

# works but with warnings
clf.fit(X, y)

Setting alpha=np.inf is strange, but it can be educational?

Edit: In other words, should we distinguish between [0.0, inf] and [0.0, inf)? Currently:

  • check_scalar(..., min_val=0.0, include_boundaries="both")
  • check_scalar(..., min_val=0.0, include_boundaries="left")

both mean [0.0, inf], where the inf is included. Here is a snippet to show case that np.inf passes both checks:

from sklearn.utils.validation import check_scalar
import numpy as np

check_scalar(np.inf, name="value", target_type=float, min_val=0.0, include_boundaries="both")
check_scalar(np.inf, name="value", target_type=float, min_val=0.0, include_boundaries="left")

Copy link
Member

@ogrisel ogrisel Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that most of the time in the docstring we should note [0.0, inf) to exclude inf as invalid.

But there might be places where np.inf is a valid value for the parameter and has a special meaning and it should be documented what it does in the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: for the Poisson regressor example with infinite alpha I get:

/Users/ogrisel/code/scikit-learn/sklearn/linear_model/_glm/glm.py:302: RuntimeWarning: invalid value encountered in multiply
  coef_scaled = alpha * coef[offset:]
/Users/ogrisel/code/scikit-learn/sklearn/linear_model/_glm/glm.py:323: ConvergenceWarning: lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res)

so rejecting it seems a good idea :)

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@glemaitre glemaitre merged commit b361f37 into scikit-learn:main Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants