Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MAINT Use check_scalar to validate scalar in: GeneralizedLinearRegressor #21946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jan 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
fcc822c
added check_scalar
reshamas Dec 7, 2021
f66cd3c
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Dec 10, 2021
24391d1
added tests for max_iter
reshamas Dec 10, 2021
917b05b
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Dec 12, 2021
e8318fc
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Dec 13, 2021
4e9a84c
added tests for alpha
reshamas Dec 13, 2021
cbafadf
adding tests for tol
reshamas Dec 13, 2021
1e56683
added tests for verbose
reshamas Dec 13, 2021
a0c1055
remove extra checks for alpha and tol
reshamas Dec 15, 2021
d495edd
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Dec 15, 2021
ca2b8ef
remove default parameters in function call
reshamas Dec 16, 2021
0bafb3b
remove default calls
reshamas Dec 16, 2021
9a58697
add range
reshamas Dec 16, 2021
9fb9ce8
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Dec 21, 2021
569f510
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Dec 22, 2021
4354567
fixing flake8 error
reshamas Dec 22, 2021
0c17768
comment out tol=1 check
reshamas Dec 22, 2021
db18608
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Dec 23, 2021
3ef2a20
add multiple estimators in parametrization
reshamas Dec 23, 2021
414d605
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Jan 1, 2022
0c62527
Update wording for interval range: "should be" to "must be"
reshamas Jan 7, 2022
97ac32e
remove commented isinstance check
reshamas Jan 7, 2022
f9bc8a0
capitalize "estimator" to "Estimator"
reshamas Jan 7, 2022
22f4962
remove commented portion for checking tol
reshamas Jan 7, 2022
3f9b7a9
estimator should be "Estimator"
reshamas Jan 7, 2022
f94b36d
estimator should be "Estimator"
reshamas Jan 7, 2022
2b765a5
for interval range, change from should to must
reshamas Jan 7, 2022
861b805
Merge branch 'ckscalar_glm' of https://github.com/reshamas/scikit-lea…
reshamas Jan 7, 2022
98b9484
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Jan 7, 2022
3478713
Merge branch 'main' of github.com:scikit-learn/scikit-learn into cksc…
reshamas Jan 13, 2022
9f04ec8
removing interval ranges; added to PR#22076
reshamas Jan 13, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 27 additions & 17 deletions sklearn/linear_model/_glm/glm.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

from ...base import BaseEstimator, RegressorMixin
from ...utils.optimize import _check_optimize_result
from ...utils import check_scalar
from ...utils.validation import check_is_fitted, _check_sample_weight
from ..._loss.glm_distribution import (
ExponentialDispersionModel,
Expand Down Expand Up @@ -209,12 +210,13 @@ def fit(self, X, y, sample_weight=None):
"got (link={0})".format(self.link)
)

if not isinstance(self.alpha, numbers.Number) or self.alpha < 0:
raise ValueError(
"Penalty term must be a non-negative number; got (alpha={0})".format(
self.alpha
)
)
check_scalar(
self.alpha,
name="alpha",
target_type=numbers.Real,
min_val=0.0,
include_boundaries="left",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we are not consistent here, but since the default is both, I think we can leave out 'left'

Suggested change
include_boundaries="left",

Also, this is consistent with the rest of this PR that leaves out include_boundaries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan Not sure I understand why we are leaving it "both" when it's "left"?

Copy link
Member

@thomasjpfan thomasjpfan Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With max_val set to None, "left" and "both" means the same thing in terms of the upper bound because the upper bound is not checked.

For alpha, I think it is technically "both" because np.inf is a valid value for alpha in the GLMs:

from sklearn import linear_model
import numpy as np

clf = linear_model.PoissonRegressor(alpha=np.inf)
X = [[1, 2], [2, 3], [3, 4], [4, 3]]
y = [12, 17, 22, 21]

# works but with warnings
clf.fit(X, y)

Setting alpha=np.inf is strange, but it can be educational?

Edit: In other words, should we distinguish between [0.0, inf] and [0.0, inf)? Currently:

  • check_scalar(..., min_val=0.0, include_boundaries="both")
  • check_scalar(..., min_val=0.0, include_boundaries="left")

both mean [0.0, inf], where the inf is included. Here is a snippet to show case that np.inf passes both checks:

from sklearn.utils.validation import check_scalar
import numpy as np

check_scalar(np.inf, name="value", target_type=float, min_val=0.0, include_boundaries="both")
check_scalar(np.inf, name="value", target_type=float, min_val=0.0, include_boundaries="left")

Copy link
Member

@ogrisel ogrisel Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that most of the time in the docstring we should note [0.0, inf) to exclude inf as invalid.

But there might be places where np.inf is a valid value for the parameter and has a special meaning and it should be documented what it does in the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: for the Poisson regressor example with infinite alpha I get:

/Users/ogrisel/code/scikit-learn/sklearn/linear_model/_glm/glm.py:302: RuntimeWarning: invalid value encountered in multiply
  coef_scaled = alpha * coef[offset:]
/Users/ogrisel/code/scikit-learn/sklearn/linear_model/_glm/glm.py:323: ConvergenceWarning: lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res)

so rejecting it seems a good idea :)

)
if not isinstance(self.fit_intercept, bool):
raise ValueError(
"The argument fit_intercept must be bool; got {0}".format(
Expand All @@ -227,17 +229,25 @@ def fit(self, X, y, sample_weight=None):
"'lbfgs'; got {0}".format(self.solver)
)
solver = self.solver
if not isinstance(self.max_iter, numbers.Integral) or self.max_iter <= 0:
raise ValueError(
"Maximum number of iteration must be a positive "
"integer;"
" got (max_iter={0!r})".format(self.max_iter)
)
if not isinstance(self.tol, numbers.Number) or self.tol <= 0:
raise ValueError(
"Tolerance for stopping criteria must be "
"positive; got (tol={0!r})".format(self.tol)
)
check_scalar(
self.max_iter,
name="max_iter",
target_type=numbers.Integral,
min_val=1,
)
check_scalar(
self.tol,
name="tol",
target_type=numbers.Real,
min_val=0.0,
include_boundaries="neither",
)
check_scalar(
self.verbose,
name="verbose",
target_type=numbers.Integral,
min_val=0,
)
if not isinstance(self.warm_start, bool):
raise ValueError(
"The argument warm_start must be bool; got {0}".format(self.warm_start)
Expand Down
90 changes: 65 additions & 25 deletions sklearn/linear_model/_glm/tests/test_glm.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,16 +110,6 @@ def test_glm_link_auto(family, expected_link_class):
assert isinstance(glm._link_instance, expected_link_class)


@pytest.mark.parametrize("alpha", ["not a number", -4.2])
def test_glm_alpha_argument(alpha):
"""Test GLM for invalid alpha argument."""
y = np.array([1, 2])
X = np.array([[1], [2]])
glm = GeneralizedLinearRegressor(family="normal", alpha=alpha)
with pytest.raises(ValueError, match="Penalty term must be a non-negative"):
glm.fit(X, y)


@pytest.mark.parametrize("fit_intercept", ["not bool", 1, 0, [True]])
def test_glm_fit_intercept_argument(fit_intercept):
"""Test GLM for invalid fit_intercept argument."""
Expand All @@ -140,23 +130,73 @@ def test_glm_solver_argument(solver):
glm.fit(X, y)


@pytest.mark.parametrize("max_iter", ["not a number", 0, -1, 5.5, [1]])
def test_glm_max_iter_argument(max_iter):
"""Test GLM for invalid max_iter argument."""
y = np.array([1, 2])
X = np.array([[1], [2]])
glm = GeneralizedLinearRegressor(max_iter=max_iter)
with pytest.raises(ValueError, match="must be a positive integer"):
glm.fit(X, y)


@pytest.mark.parametrize("tol", ["not a number", 0, -1.0, [1e-3]])
def test_glm_tol_argument(tol):
"""Test GLM for invalid tol argument."""
@pytest.mark.parametrize(
"Estimator",
[GeneralizedLinearRegressor, PoissonRegressor, GammaRegressor, TweedieRegressor],
)
@pytest.mark.parametrize(
"params, err_type, err_msg",
[
({"max_iter": 0}, ValueError, "max_iter == 0, must be >= 1"),
({"max_iter": -1}, ValueError, "max_iter == -1, must be >= 1"),
(
{"max_iter": "not a number"},
TypeError,
"max_iter must be an instance of <class 'numbers.Integral'>, not <class"
" 'str'>",
),
(
{"max_iter": [1]},
TypeError,
"max_iter must be an instance of <class 'numbers.Integral'>,"
" not <class 'list'>",
),
(
{"max_iter": 5.5},
TypeError,
"max_iter must be an instance of <class 'numbers.Integral'>,"
" not <class 'float'>",
),
({"alpha": -1}, ValueError, "alpha == -1, must be >= 0.0"),
(
{"alpha": "1"},
TypeError,
"alpha must be an instance of <class 'numbers.Real'>, not <class 'str'>",
),
({"tol": -1.0}, ValueError, "tol == -1.0, must be > 0."),
({"tol": 0.0}, ValueError, "tol == 0.0, must be > 0.0"),
({"tol": 0}, ValueError, "tol == 0, must be > 0.0"),
(
{"tol": "1"},
TypeError,
"tol must be an instance of <class 'numbers.Real'>, not <class 'str'>",
),
(
{"tol": [1e-3]},
TypeError,
"tol must be an instance of <class 'numbers.Real'>, not <class 'list'>",
),
({"verbose": -1}, ValueError, "verbose == -1, must be >= 0."),
(
{"verbose": "1"},
TypeError,
"verbose must be an instance of <class 'numbers.Integral'>, not <class"
" 'str'>",
),
(
{"verbose": 1.0},
TypeError,
"verbose must be an instance of <class 'numbers.Integral'>, not <class"
" 'float'>",
),
],
)
def test_glm_scalar_argument(Estimator, params, err_type, err_msg):
"""Test GLM for invalid parameter arguments."""
y = np.array([1, 2])
X = np.array([[1], [2]])
glm = GeneralizedLinearRegressor(tol=tol)
with pytest.raises(ValueError, match="stopping criteria must be positive"):
glm = Estimator(**params)
with pytest.raises(err_type, match=err_msg):
glm.fit(X, y)


Expand Down