-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Minimal Generalized linear models implementation (L2 + lbfgs) #14300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
919912c
01033e3
4071a8a
757bc3c
ed8e74f
fe876da
2993e03
a6f9f13
c9a7a95
a7755de
9aa1fc4
ca3eae2
61bc6b8
b24a7ca
f95b390
09176b4
def12ae
9b574bd
90299fd
e3a5a9a
54b80b8
e962859
7db0320
dcfe9ed
4879bb6
56069e5
53f3c5f
ac1fef3
be5a3c4
e67fecb
62f4448
d25042e
07ee495
d0eb285
1e4b538
3265148
1862ab6
4154074
c5d77d7
9ab5ac2
e4d0be1
6ff4d58
13102d5
af89e52
3802420
ddc4b71
426ae1d
a404384
65796a3
e44afe7
5927379
a6df2a7
5af89a7
cd347d4
0d7f9cd
dbffad8
d914ab2
3187204
cc03c1a
4d433d1
aa52b4a
816aa8f
6500c81
59a6d9d
03a8a2d
bbf7f38
49a3a8e
228e8c8
09a57c9
4b485ca
aa0adf1
abd47d7
c65ac12
7a9d067
18b4503
29658d6
1ea70d3
efdcb5b
ef0d063
0125e1c
6a8a600
73f3bd1
11b178f
3806fbe
ae1c672
f07c831
e34fb57
ebbbe9c
918e257
5236cd8
37d0f47
9c337f2
5e05935
4a68213
8ee5c85
b106c25
a1f8aab
c0999ea
e09e336
6601d30
81eabe3
5174dae
61dc13f
44524ca
58d2409
ee351e1
33fe9be
94272e7
2457039
6396d2c
8be0387
da66fd5
b9bc170
ab6c5d8
b424a07
95a9058
59eceb4
12a5067
04f30f4
3630b52
516eadb
5e14928
6cc1df5
752d6aa
38a4ad4
c15a1cc
37de07b
adbf997
47dbc84
3b526e9
b1eb611
d964c01
a1844b8
f513392
84229a6
dd22699
059aeb7
8c6c255
0a23313
29964af
976b436
7c850d1
f64dc4a
b1f5bde
a9ab4e4
be7bb67
4125c20
0070d52
9d6bb52
b353b2d
88757fd
6d119d4
b735eb7
2d91114
89103bc
4f28a44
d4dfd0b
64d6fbd
82ace9f
5288a0f
499e8d2
f4aa839
48fcbe6
d71fb9f
fa90272
4d16f31
15eb1d3
3d097c6
6372287
b117856
31f5b3d
a498ff5
3fae28a
a2b6841
a47798a
83391dd
3bfb54e
d325fe2
661cf56
560c180
0ea2dce
4ca2e95
89b429d
2d0b195
ea6a3e8
ddae396
21572d9
d7ff6f4
939d240
da0d2a6
162fb3b
cafc92f
d3235db
9401230
5bc48e5
e572c31
8d70042
6dc5e34
c479889
3628cff
88d150e
0d31f47
3ab2877
87de01b
072f4fb
fbc22d8
27ae4a2
51be7e1
25f0e53
5eddf9c
81068ce
c700acf
a126f4a
82c4483
d3083db
f63f795
45de110
b8459b0
581b4d7
bb75435
94dfc00
0810bf3
a90a0aa
f74ab96
a349be7
497a76c
c8c902f
bda7ad6
83c2ba6
7498f3e
90b1213
697bda2
578408c
2668910
d1c3dc9
a9686f6
04e7aca
21a739c
79ada1e
39eeb44
e3cf69d
56aa0d7
27a344c
e817b2c
0fdc518
6d4ecb2
b96e021
293214c
987239a
edba3b8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -556,13 +556,13 @@ orthogonal matching pursuit can approximate the optimum solution vector with a | |
fixed number of non-zero elements: | ||
|
||
.. math:: | ||
\underset{\gamma}{\operatorname{arg\,min\,}} ||y - X\gamma||_2^2 \text{ subject to } ||\gamma||_0 \leq n_{\text{nonzero\_coefs}} | ||
\underset{w}{\operatorname{arg\,min\,}} ||y - Xw||_2^2 \text{ subject to } ||w||_0 \leq n_{\text{nonzero\_coefs}} | ||
|
||
Alternatively, orthogonal matching pursuit can target a specific error instead | ||
of a specific number of non-zero coefficients. This can be expressed as: | ||
|
||
.. math:: | ||
\underset{\gamma}{\operatorname{arg\,min\,}} ||\gamma||_0 \text{ subject to } ||y-X\gamma||_2^2 \leq \text{tol} | ||
\underset{w}{\operatorname{arg\,min\,}} ||w||_0 \text{ subject to } ||y-Xw||_2^2 \leq \text{tol} | ||
|
||
|
||
OMP is based on a greedy algorithm that includes at each step the atom most | ||
|
@@ -906,7 +906,7 @@ with 'log' loss, which might be even faster but requires more tuning. | |
It is possible to obtain the p-values and confidence intervals for | ||
coefficients in cases of regression without penalization. The `statsmodels | ||
package <https://pypi.org/project/statsmodels/>` natively supports this. | ||
Within sklearn, one could use bootstrapping instead as well. | ||
Within sklearn, one could use bootstrapping instead as well. | ||
|
||
|
||
:class:`LogisticRegressionCV` implements Logistic Regression with built-in | ||
|
@@ -928,6 +928,149 @@ to warm-starting (see :term:`Glossary <warm_start>`). | |
.. [9] `"Performance Evaluation of Lbfgs vs other solvers" | ||
<http://www.fuzihao.org/blog/2016/01/16/Comparison-of-Gradient-Descent-Stochastic-Gradient-Descent-and-L-BFGS/>`_ | ||
|
||
.. _Generalized_linear_regression: | ||
|
||
Generalized Linear Regression | ||
============================= | ||
|
||
Generalized Linear Models (GLM) extend linear models in two ways | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[10]_. First, the predicted values :math:`\hat{y}` are linked to a linear | ||
combination of the input variables :math:`X` via an inverse link function | ||
:math:`h` as | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this is an inverse function, would it makes sense to denote it It would also be nice to suggest what this link function is doing. From what I saw this is a link between the linear predictor and the mean of the distribution? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First, this is standard literature nomenclature: link function g: g(E[Y]) = Xw and inverse link function h: h(Xw) = g^{-1}(Xw) = E[Y]. Second, we had this info/equation in the user guide, but some reviewer wanted it gone 😢 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree the link function I don't like using E[Y] because:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We should go with it then :) (my remarks are from somebody which is not familiar with GLMs)
I was not thinking to go until this level of detail but more about defining with a couple of words what is the link function. Regarding the mathematical section, I'm fine with the majority of reviewers thought it was an overkill. I will not be pushy on this :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @NicolasHug First excuse - it was late 😊 Second excuse, GLM literature is lazy and skips the conditional most of the time, e.g. E[Y] really means E[Y|X]. More seriously: Maybe, I missed that piece in the actual user guide, but I can't find a reference as to what regressors do actually estimate/predict (most of the time E[Y|X]). That's why I bring this up again (last time, promise). Should we add that info somewhere in the UG? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's a possibility, yes. Though I would prefer doing that in another PR to make sure the UG is overall consistent, since this would be a non-trivial update. In the mean time, I still think we can have a reasonable UG for the GLMs that doesn't mention E[Y/X], we can just talk about predictions. |
||
|
||
.. math:: \hat{y}(w, X) = h(Xw). | ||
rth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Secondly, the squared loss function is replaced by the unit deviance | ||
:math:`d` of a distribution in the exponential family (or more precisely, a | ||
reproductive exponential dispersion model (EDM) [11]_). | ||
|
||
The minimization problem becomes: | ||
|
||
.. math:: \min_{w} \frac{1}{2 n_{\text{samples}}} \sum_i d(y_i, \hat{y}_i) + \frac{\alpha}{2} ||w||_2, | ||
|
||
where :math:`\alpha` is the L2 regularization penalty. When sample weights are | ||
provided, the average becomes a weighted average. | ||
|
||
The following table lists some specific EDMs and their unit deviance (all of | ||
these are instances of the Tweedie family): | ||
|
||
================= =============================== ============================================ | ||
Distribution Target Domain Unit Deviance :math:`d(y, \hat{y})` | ||
================= =============================== ============================================ | ||
Normal :math:`y \in (-\infty, \infty)` :math:`(y-\hat{y})^2` | ||
Poisson :math:`y \in [0, \infty)` :math:`2(y\log\frac{y}{\hat{y}}-y+\hat{y})` | ||
Gamma :math:`y \in (0, \infty)` :math:`2(\log\frac{\hat{y}}{y}+\frac{y}{\hat{y}}-1)` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it make sense to add the Tweedie distribution in the list as well? The first regressor below is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would say so. Only objection is to have this info 2 times. Maybe a link to the |
||
Inverse Gaussian :math:`y \in (0, \infty)` :math:`\frac{(y-\hat{y})^2}{y\hat{y}^2}` | ||
================= =============================== ============================================ | ||
|
||
The Probability Density Functions (PDF) of these distributions are illustrated | ||
in the following figure, | ||
|
||
.. figure:: ./glm_data/poisson_gamma_tweedie_distributions.png | ||
:align: center | ||
:scale: 100% | ||
|
||
PDF of a random variable Y following Poisson, Tweedie (power=1.5) and Gamma | ||
distributions with different mean values (:math:`\mu`). Observe the point | ||
mass at :math:`Y=0` for the Poisson distribution and the Tweedie (power=1.5) | ||
distribution, but not for the Gamma distribution which has a strictly | ||
positive target domain. | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The choice of the distribution depends on the problem at hand: | ||
|
||
* If the target values :math:`y` are counts (non-negative integer valued) or | ||
relative frequencies (non-negative), you might use a Poisson deviance | ||
with log-link. | ||
* If the target values are positive valued and skewed, you might try a | ||
Gamma deviance with log-link. | ||
* If the target values seem to be heavier tailed than a Gamma distribution, | ||
you might try an Inverse Gaussian deviance (or even higher variance powers | ||
of the Tweedie family). | ||
|
||
|
||
Examples of use cases include: | ||
|
||
* Agriculture / weather modeling: number of rain events per year (Poisson), | ||
amount of rainfall per event (Gamma), total rainfall per year (Tweedie / | ||
Compound Poisson Gamma). | ||
* Risk modeling / insurance policy pricing: number of claim events / | ||
policyholder per year (Poisson), cost per event (Gamma), total cost per | ||
policyholder per year (Tweedie / Compound Poisson Gamma). | ||
* Predictive maintenance: number of production interruption events per year: | ||
Poisson, duration of interruption: Gamma, total interruption time per year | ||
(Tweedie / Compound Poisson Gamma). | ||
|
||
|
||
.. topic:: References: | ||
|
||
.. [10] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, | ||
Second Edition. Boca Raton: Chapman and Hall/CRC. ISBN 0-412-31760-5. | ||
|
||
.. [11] Jørgensen, B. (1992). The theory of exponential dispersion models | ||
and analysis of deviance. Monografias de matemática, no. 51. See also | ||
`Exponential dispersion model. | ||
<https://en.wikipedia.org/wiki/Exponential_dispersion_model>`_ | ||
|
||
Usage | ||
----- | ||
|
||
:class:`TweedieRegressor` implements a generalized linear model for the | ||
Tweedie distribution, that allows to model any of the above mentioned | ||
distributions using the appropriate ``power`` parameter. In particular: | ||
|
||
- ``power = 0``: Normal distribution. Specific estimators such as | ||
:class:`Ridge`, :class:`ElasticNet` are generally more appropriate in | ||
this case. | ||
- ``power = 1``: Poisson distribution. :class:`PoissonRegressor` is exposed | ||
for convenience. However, it is strictly equivalent to | ||
`TweedieRegressor(power=1, link='log')`. | ||
- ``power = 2``: Gamma distribution. :class:`GammaRegressor` is exposed for | ||
convenience. However, it is strictly equivalent to | ||
`TweedieRegressor(power=2, link='log')`. | ||
- ``power = 3``: Inverse Gaussian distribution. | ||
|
||
The link function is determined by the `link` parameter. | ||
|
||
Usage example:: | ||
|
||
>>> from sklearn.linear_model import TweedieRegressor | ||
>>> reg = TweedieRegressor(power=1, alpha=0.5, link='log') | ||
>>> reg.fit([[0, 0], [0, 1], [2, 2]], [0, 1, 2]) | ||
TweedieRegressor(alpha=0.5, link='log', power=1) | ||
>>> reg.coef_ | ||
array([0.2463..., 0.4337...]) | ||
>>> reg.intercept_ | ||
-0.7638... | ||
|
||
|
||
.. topic:: Examples: | ||
|
||
* :ref:`sphx_glr_auto_examples_linear_model_plot_poisson_regression_non_normal_loss.py` | ||
* :ref:`sphx_glr_auto_examples_linear_model_plot_tweedie_regression_insurance_claims.py` | ||
|
||
Practical considerations | ||
------------------------ | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The feature matrix `X` should be standardized before fitting. This ensures | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would we always apply standard-scaling or one should also consider robust-scaling and so on? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes and we should apply the same method. It will remove the need for scaling from an optimization perspective. Unscaled data means that the penalization is weirdly weighted but if we don't scale we should make sure the algorithm still converges if the data is not scaled. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I have added item to the todo list of this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think it's standard practice to say that the data should be standardized or scaled. Which scaler to use and whether it should be robust to outliers is I think off topic in this section of the user guide. Users can read the preprocessing docs if interested.
Copying what I responded in #14300 (comment), since I'm doing another pass on comments, I don't think it belongs in this PR. This PR is a minimal implementation, and while preconditioning may improve convergence it is not essential, we have been using LBFGS without for years. Also as the above linked PR showed the results were not as clear cut as expected. |
||
that the penalty treats features equally. | ||
|
||
Since the linear predictor :math:`Xw` can be negative and Poisson, | ||
Gamma and Inverse Gaussian distributions don't support negative values, it | ||
is necessary to apply an inverse link function that guarantees the | ||
non-negativeness. For example with `link='log'`, the inverse link function | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
becomes :math:`h(Xw)=\exp(Xw)`. | ||
rth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
If you want to model a relative frequency, i.e. counts per exposure (time, | ||
volume, ...) you can do so by using a Poisson distribution and passing | ||
:math:`y=\frac{\mathrm{counts}}{\mathrm{exposure}}` as target values | ||
together with :math:`\mathrm{exposure}` as sample weights. For a concrete | ||
example see e.g. | ||
:ref:`sphx_glr_auto_examples_linear_model_plot_tweedie_regression_insurance_claims.py`. | ||
|
||
When performing cross-validation for the `power` parameter of | ||
`TweedieRegressor`, it is advisable to specify an explicit `scoring` function, | ||
because the default scorer :meth:`TweedieRegressor.score` is a function of | ||
`power` itself. | ||
rth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Stochastic Gradient Descent - SGD | ||
================================= | ||
|
Uh oh!
There was an error while loading. Please reload this page.