-
-
Notifications
You must be signed in to change notification settings - Fork 26k
MAINT Use check_scalar to validate scalar in: GeneralizedLinearRegressor #21946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
You should remove:
These tests will be covered by the more generic test functions that you created and parametrized. |
The tol parametrization I added only tests for negative values. Should I also add in a test for string? |
Yes, we should test for a type of data that should not be supported to ensure that we raise a |
OK, I think I will add it for |
@glemaitre Should I change my code so it reads: 'stopping criteria must be positive' tol = -1.0
@pytest.mark.parametrize("tol", ["not a number", 0, -1.0, [1e-3]])
def test_glm_tol_argument(tol):
"""Test GLM for invalid tol argument."""
y = np.array([1, 2])
X = np.array([[1], [2]])
glm = GeneralizedLinearRegressor(tol=tol)
with pytest.raises(ValueError, match="stopping criteria must be positive"):
> glm.fit(X, y)
E AssertionError: Regex pattern 'stopping criteria must be positive' does not match 'tol == -1.0, must be >= 0.'.
/h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @reshamas !
sklearn/linear_model/_glm/glm.py
Outdated
@@ -72,6 +73,7 @@ class GeneralizedLinearRegressor(RegressorMixin, BaseEstimator): | |||
regularization strength. ``alpha = 0`` is equivalent to unpenalized | |||
GLMs. In this case, the design matrix `X` must have full column rank | |||
(no collinearities). | |||
Values should be >=0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to show ranges, I think we should be consistent with #21955 (When that PR gets merged)
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Reminders:
|
@thomasjpfan @glemaitre Is the below correct?
|
Actually, we should. But it could be easier because I would expect that all these models to have the same API than @pytest.mark.parametrize("Estimator", [GeneralizedLinearModel, TweedieRegressor, ...]) should be enough.
Yes it would be better.
Let's first merge this PR. It has a reasonable size and it will be easier to review. |
@glemaitre I think this PR is in a good place for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @reshamas for this contribution.
Here are a few comments.
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
… into ckscalar_glm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
sklearn/linear_model/_glm/glm.py
Outdated
|
||
warm_start : bool, default=False | ||
If set to ``True``, reuse the solution of the previous call to ``fit`` | ||
as initialization for ``coef_`` and ``intercept_``. | ||
|
||
verbose : int, default=0 | ||
For the lbfgs solver set verbose to any positive number for verbosity. | ||
Values must be in the range `[1, inf)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Values must be in the range `[1, inf)`. | |
Values must be in the range `[0, inf)`. |
name="alpha", | ||
target_type=numbers.Real, | ||
min_val=0.0, | ||
include_boundaries="left", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we are not consistent here, but since the default is both, I think we can leave out 'left'
include_boundaries="left", |
Also, this is consistent with the rest of this PR that leaves out include_boundaries
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan Not sure I understand why we are leaving it "both" when it's "left"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With max_val
set to None
, "left" and "both" means the same thing in terms of the upper bound because the upper bound is not checked.
For alpha
, I think it is technically "both" because np.inf
is a valid value for alpha
in the GLMs:
from sklearn import linear_model
import numpy as np
clf = linear_model.PoissonRegressor(alpha=np.inf)
X = [[1, 2], [2, 3], [3, 4], [4, 3]]
y = [12, 17, 22, 21]
# works but with warnings
clf.fit(X, y)
Setting alpha=np.inf
is strange, but it can be educational?
Edit: In other words, should we distinguish between [0.0, inf]
and [0.0, inf)
? Currently:
check_scalar(..., min_val=0.0, include_boundaries="both")
check_scalar(..., min_val=0.0, include_boundaries="left")
both mean [0.0, inf]
, where the inf
is included. Here is a snippet to show case that np.inf
passes both checks:
from sklearn.utils.validation import check_scalar
import numpy as np
check_scalar(np.inf, name="value", target_type=float, min_val=0.0, include_boundaries="both")
check_scalar(np.inf, name="value", target_type=float, min_val=0.0, include_boundaries="left")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that most of the time in the docstring we should note [0.0, inf)
to exclude inf as invalid.
But there might be places where np.inf
is a valid value for the parameter and has a special meaning and it should be documented what it does in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: for the Poisson regressor example with infinite alpha I get:
/Users/ogrisel/code/scikit-learn/sklearn/linear_model/_glm/glm.py:302: RuntimeWarning: invalid value encountered in multiply
coef_scaled = alpha * coef[offset:]
/Users/ogrisel/code/scikit-learn/sklearn/linear_model/_glm/glm.py:323: ConvergenceWarning: lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
self.n_iter_ = _check_optimize_result("lbfgs", opt_res)
so rejecting it seems a good idea :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Reference Issues/PRs
References #21927
What does this implement/fix? Explain your changes.
sklearn/linear_model/_glm/glm.py
GeneralizedLinearRegressor
check_scalar
from sklearn.utils to validate the scalar parameters.Parameters:
alpha
check_scalar
max_iter
check_scalar
tol
check_scalar
verbose
check_scalar
Any other comments?
#DataUmbrella #postsprint