diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst index 18e1bf468dc62..7243990bb5ffe 100644 --- a/doc/modules/linear_model.rst +++ b/doc/modules/linear_model.rst @@ -298,6 +298,7 @@ features, it is often faster than :class:`LassoCV`. .. centered:: |lasso_cv_1| |lasso_cv_2| +.. _lasso_lars_ic: Information-criteria based model selection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -306,22 +307,92 @@ Alternatively, the estimator :class:`LassoLarsIC` proposes to use the Akaike information criterion (AIC) and the Bayes Information criterion (BIC). It is a computationally cheaper alternative to find the optimal value of alpha as the regularization path is computed only once instead of k+1 times -when using k-fold cross-validation. However, such criteria needs a -proper estimation of the degrees of freedom of the solution, are -derived for large samples (asymptotic results) and assume the model -is correct, i.e. that the data are generated by this model. -They also tend to break when the problem is badly conditioned -(more features than samples). - -.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_001.png - :target: ../auto_examples/linear_model/plot_lasso_model_selection.html +when using k-fold cross-validation. + +Indeed, these criteria are computed on the in-sample training set. In short, +they penalize the over-optimistic scores of the different Lasso models by +their flexibility (cf. to "Mathematical details" section below). + +However, such criteria need a proper estimation of the degrees of freedom of +the solution, are derived for large samples (asymptotic results) and assume the +correct model is candidates under investigation. They also tend to break when +the problem is badly conditioned (e.g. more features than samples). + +.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_lasso_lars_ic_001.png + :target: ../auto_examples/linear_model/plot_lasso_lars_ic.html :align: center :scale: 50% +.. _aic_bic: + +**Mathematical details** + +The definition of AIC (and thus BIC) might differ in the literature. In this +section, we give more information regarding the criterion computed in +scikit-learn. The AIC criterion is defined as: + +.. math:: + AIC = -2 \log(\hat{L}) + 2 d + +where :math:`\hat{L}` is the maximum likelihood of the model and +:math:`d` is the number of parameters (as well referred to as degrees of +freedom in the previous section). + +The definition of BIC replace the constant :math:`2` by :math:`\log(N)`: + +.. math:: + BIC = -2 \log(\hat{L}) + \log(N) d + +where :math:`N` is the number of samples. + +For a linear Gaussian model, the maximum log-likelihood is defined as: + +.. math:: + \log(\hat{L}) = - \frac{n}{2} \log(2 \pi) - \frac{n}{2} \ln(\sigma^2) - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{2\sigma^2} + +where :math:`\sigma^2` is an estimate of the noise variance, +:math:`y_i` and :math:`\hat{y}_i` are respectively the true and predicted +targets, and :math:`n` is the number of samples. + +Plugging the maximum log-likelihood in the AIC formula yields: + +.. math:: + AIC = n \log(2 \pi \sigma^2) + \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sigma^2} + 2 d + +The first term of the above expression is sometimes discarded since it is a +constant when :math:`\sigma^2` is provided. In addition, +it is sometimes stated that the AIC is equivalent to the :math:`C_p` statistic +[12]_. In a strict sense, however, it is equivalent only up to some constant +and a multiplicative factor. + +At last, we mentioned above that :math:`\sigma^2` is an estimate of the +noise variance. In :class:`LassoLarsIC` when the parameter `noise_variance` is +not provided (default), the noise variance is estimated via the unbiased +estimator [13]_ defined as: + +.. math:: + \sigma^2 = \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n - p} + +where :math:`p` is the number of features and :math:`\hat{y}_i` is the +predicted target using an ordinary least squares regression. Note, that this +formula is valid only when `n_samples > n_features`. .. topic:: Examples: * :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py` + * :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_lars_ic.py` + +.. topic:: References + + .. [12] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani. + "On the degrees of freedom of the lasso." + The Annals of Statistics 35.5 (2007): 2173-2192. + <0712.0881.pdf>` + + .. [13] `Cherkassky, Vladimir, and Yunqian Ma. + "Comparison of model selection for regression." + Neural computation 15.7 (2003): 1691-1714. + `_ Comparison with the regularization parameter of SVM ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -934,8 +1005,8 @@ to warm-starting (see :term:`Glossary `). .. [6] Mark Schmidt, Nicolas Le Roux, and Francis Bach: `Minimizing Finite Sums with the Stochastic Average Gradient. `_ - .. [7] Aaron Defazio, Francis Bach, Simon Lacoste-Julien: - :arxiv:`SAGA: A Fast Incremental Gradient Method With Support for + .. [7] Aaron Defazio, Francis Bach, Simon Lacoste-Julien: + :arxiv:`SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. <1407.0202>` .. [8] https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm diff --git a/doc/whats_new/v1.1.rst b/doc/whats_new/v1.1.rst index cc607aedb3946..94fc0ca160396 100644 --- a/doc/whats_new/v1.1.rst +++ b/doc/whats_new/v1.1.rst @@ -141,6 +141,21 @@ Changelog multilabel classification. :pr:`19689` by :user:`Guillaume Lemaitre `. +:mod:`sklearn.linear_model` +........................... + +- |API| :class:`linear_model.LassoLarsIC` now exposes `noise_variance` as + a parameter in order to provide an estimate of the noise variance. + This is particularly relevant when `n_features > n_samples` and the + estimator of the noise variance cannot be computed. + :pr:`21481` by :user:`Guillaume Lemaitre ` + +- |Fix| :class:`linear_model.LassoLarsIC` now correctly computes AIC + and BIC. An error is now raised when `n_features > n_samples` and + when the noise variance is not provided. + :pr:`21481` by :user:`Guillaume Lemaitre ` and + :user:`Andrés Babino `. + :mod:`sklearn.metrics` ...................... diff --git a/examples/linear_model/plot_lasso_lars_ic.py b/examples/linear_model/plot_lasso_lars_ic.py new file mode 100644 index 0000000000000..2f5392696ecc9 --- /dev/null +++ b/examples/linear_model/plot_lasso_lars_ic.py @@ -0,0 +1,123 @@ +""" +============================================== +Lasso model selection via information criteria +============================================== + +This example reproduces the example of Fig. 2 of [ZHT2007]_. A +:class:`~sklearn.linear_model.LassoLarsIC` estimator is fit on a +diabetes dataset and the AIC and the BIC criteria are used to select +the best model. + +.. note:: + It is important to note that the optimization to find `alpha` with + :class:`~sklearn.linear_model.LassoLarsIC` relies on the AIC or BIC + criteria that are computed in-sample, thus on the training set directly. + This approach differs from the cross-validation procedure. For a comparison + of the two approaches, you can refer to the following example: + :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py`. + +.. topic:: References + + .. [ZHT2007] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani. + "On the degrees of freedom of the lasso." + The Annals of Statistics 35.5 (2007): 2173-2192. + <0712.0881>` +""" + +# Author: Alexandre Gramfort +# Guillaume Lemaitre +# License: BSD 3 clause + +# %% +import sklearn + +sklearn.set_config(display="diagram") + +# %% +# We will use the diabetes dataset. +from sklearn.datasets import load_diabetes + +X, y = load_diabetes(return_X_y=True, as_frame=True) +n_samples = X.shape[0] +X.head() + +# %% +# Scikit-learn provides an estimator called +# :class:`~sklearn.linear_model.LinearLarsIC` that uses either Akaike's +# information criterion (AIC) or the Bayesian information criterion (BIC) to +# select the best model. Before fitting +# this model, we will scale the dataset. +# +# In the following, we are going to fit two models to compare the values +# reported by AIC and BIC. +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LassoLarsIC +from sklearn.pipeline import make_pipeline + +lasso_lars_ic = make_pipeline( + StandardScaler(), LassoLarsIC(criterion="aic", normalize=False) +).fit(X, y) + + +# %% +# To be in line with the defintion in [ZHT2007]_, we need to rescale the +# AIC and the BIC. Indeed, Zou et al. are ignoring some constant terms +# compared to the original definition of AIC derived from the maximum +# log-likelihood of a linear model. You can refer to +# :ref:`mathematical detail section for the User Guide `. +def zou_et_al_criterion_rescaling(criterion, n_samples, noise_variance): + """Rescale the information criterion to follow the definition of Zou et al.""" + return criterion - n_samples * np.log(2 * np.pi * noise_variance) - n_samples + + +# %% +import numpy as np + +aic_criterion = zou_et_al_criterion_rescaling( + lasso_lars_ic[-1].criterion_, + n_samples, + lasso_lars_ic[-1].noise_variance_, +) + +index_alpha_path_aic = np.flatnonzero( + lasso_lars_ic[-1].alphas_ == lasso_lars_ic[-1].alpha_ +)[0] + +# %% +lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y) + +bic_criterion = zou_et_al_criterion_rescaling( + lasso_lars_ic[-1].criterion_, + n_samples, + lasso_lars_ic[-1].noise_variance_, +) + +index_alpha_path_bic = np.flatnonzero( + lasso_lars_ic[-1].alphas_ == lasso_lars_ic[-1].alpha_ +)[0] + +# %% +# Now that we collected the AIC and BIC, we can as well check that the minima +# of both criteria happen at the same alpha. Then, we can simplify the +# following plot. +index_alpha_path_aic == index_alpha_path_bic + +# %% +# Finally, we can plot the AIC and BIC criterion and the subsequent selected +# regularization parameter. +import matplotlib.pyplot as plt + +plt.plot(aic_criterion, color="tab:blue", marker="o", label="AIC criterion") +plt.plot(bic_criterion, color="tab:orange", marker="o", label="BIC criterion") +plt.vlines( + index_alpha_path_bic, + aic_criterion.min(), + aic_criterion.max(), + color="black", + linestyle="--", + label="Selected alpha", +) +plt.legend() +plt.ylabel("Information criterion") +plt.xlabel("Lasso model sequence") +_ = plt.title("Lasso model selection via AIC and BIC") diff --git a/examples/linear_model/plot_lasso_model_selection.py b/examples/linear_model/plot_lasso_model_selection.py index b2792c92f15bd..7cc05055b22d9 100644 --- a/examples/linear_model/plot_lasso_model_selection.py +++ b/examples/linear_model/plot_lasso_model_selection.py @@ -1,171 +1,258 @@ """ -=================================================== -Lasso model selection: Cross-Validation / AIC / BIC -=================================================== - -Use the Akaike information criterion (AIC), the Bayes Information -criterion (BIC) and cross-validation to select an optimal value -of the regularization parameter alpha of the :ref:`lasso` estimator. - -Results obtained with LassoLarsIC are based on AIC/BIC criteria. - -Information-criterion based model selection is very fast, but it -relies on a proper estimation of degrees of freedom, are -derived for large samples (asymptotic results) and assume the model -is correct, i.e. that the data are actually generated by this model. -They also tend to break when the problem is badly conditioned -(more features than samples). - -For cross-validation, we use 20-fold with 2 algorithms to compute the -Lasso path: coordinate descent, as implemented by the LassoCV class, and -Lars (least angle regression) as implemented by the LassoLarsCV class. -Both algorithms give roughly the same results. They differ with regards -to their execution speed and sources of numerical errors. - -Lars computes a path solution only for each kink in the path. As a -result, it is very efficient when there are only of few kinks, which is -the case if there are few features or samples. Also, it is able to -compute the full path without setting any meta parameter. On the -opposite, coordinate descent compute the path points on a pre-specified -grid (here we use the default). Thus it is more efficient if the number -of grid points is smaller than the number of kinks in the path. Such a -strategy can be interesting if the number of features is really large -and there are enough samples to select a large amount. In terms of -numerical errors, for heavily correlated variables, Lars will accumulate -more errors, while the coordinate descent algorithm will only sample the -path on a grid. - -Note how the optimal value of alpha varies for each fold. This -illustrates why nested-cross validation is necessary when trying to -evaluate the performance of a method for which a parameter is chosen by -cross-validation: this choice of parameter may not be optimal for unseen -data. +================================================= +Lasso model selection: AIC-BIC / cross-validation +================================================= +This example focuses on model selection for Lasso models that are +linear models with an L1 penalty for regression problems. + +Indeed, several strategies can be used to select the value of the +regularization parameter: via cross-validation or using an information +criterion, namely AIC or BIC. + +In what follows, we will discuss in details the different strategies. """ -# Author: Olivier Grisel, Gael Varoquaux, Alexandre Gramfort +# Author: Olivier Grisel +# Gael Varoquaux +# Alexandre Gramfort +# Guillaume Lemaitre # License: BSD 3 clause -import time +# %% +import sklearn -import numpy as np -import matplotlib.pyplot as plt +sklearn.set_config(display="diagram") -from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC -from sklearn import datasets +# %% +# Dataset +# ------- +# In this example, we will use the diabetes dataset. +from sklearn.datasets import load_diabetes -# This is to avoid division by zero while doing np.log10 -EPSILON = 1e-4 +X, y = load_diabetes(return_X_y=True, as_frame=True) +X.head() -X, y = datasets.load_diabetes(return_X_y=True) +# %% +# In addition, we add some random features to the original data to +# better illustrate the feature selection performed by the Lasso model. +import numpy as np +import pandas as pd rng = np.random.RandomState(42) -X = np.c_[X, rng.randn(X.shape[0], 14)] # add some bad features - -# normalize data as done by Lars to allow for comparison -X /= np.sqrt(np.sum(X ** 2, axis=0)) - -# ############################################################################# -# LassoLarsIC: least angle regression with BIC/AIC criterion - -model_bic = LassoLarsIC(criterion="bic", normalize=False) -t1 = time.time() -model_bic.fit(X, y) -t_bic = time.time() - t1 -alpha_bic_ = model_bic.alpha_ - -model_aic = LassoLarsIC(criterion="aic", normalize=False) -model_aic.fit(X, y) -alpha_aic_ = model_aic.alpha_ - - -def plot_ic_criterion(model, name, color): - criterion_ = model.criterion_ - plt.semilogx( - model.alphas_ + EPSILON, - criterion_, - "--", - color=color, - linewidth=3, - label="%s criterion" % name, - ) - plt.axvline( - model.alpha_ + EPSILON, - color=color, - linewidth=3, - label="alpha: %s estimate" % name, - ) - plt.xlabel(r"$\alpha$") - plt.ylabel("criterion") - - -plt.figure() -plot_ic_criterion(model_aic, "AIC", "b") -plot_ic_criterion(model_bic, "BIC", "r") -plt.legend() -plt.title("Information-criterion for model selection (training time %.3fs)" % t_bic) - -# ############################################################################# -# LassoCV: coordinate descent +n_random_features = 14 +X_random = pd.DataFrame( + rng.randn(X.shape[0], n_random_features), + columns=[f"random_{i:02d}" for i in range(n_random_features)], +) +X = pd.concat([X, X_random], axis=1) +# Show only a subset of the columns +X[X.columns[::3]].head() + +# %% +# Selecting Lasso via an information criterion +# -------------------------------------------- +# :class:`~sklearn.linear_model.LassoLarsIC` provides a Lasso estimator that +# uses the Akaike information criterion (AIC) or the Bayes information +# criterion (BIC) to select the optimal value of the regularization +# parameter alpha. +# +# Before fitting the model, we will standardize the data with a +# :class:`~sklearn.preprocessing.StandardScaler`. In addition, we will +# measure the time to fit and tune the hyperparameter alpha in order to +# compare with the cross-validation strategy. +# +# We will first fit a Lasso model with the AIC criterion. +import time +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LassoLarsIC +from sklearn.pipeline import make_pipeline + +start_time = time.time() +lasso_lars_ic = make_pipeline( + StandardScaler(), LassoLarsIC(criterion="aic", normalize=False) +).fit(X, y) +fit_time = time.time() - start_time + +# %% +# We store the AIC metric for each value of alpha used during `fit`. +results = pd.DataFrame( + { + "alphas": lasso_lars_ic[-1].alphas_, + "AIC criterion": lasso_lars_ic[-1].criterion_, + } +).set_index("alphas") +alpha_aic = lasso_lars_ic[-1].alpha_ + +# %% +# Now, we perform the same analysis using the BIC criterion. +lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y) +results["BIC criterion"] = lasso_lars_ic[-1].criterion_ +alpha_bic = lasso_lars_ic[-1].alpha_ + + +# %% +# We can check which value of `alpha` leads to the minimum AIC and BIC. +def highlight_min(x): + x_min = x.min() + return ["font-weight: bold" if v == x_min else "" for v in x] + + +results.style.apply(highlight_min) + +# %% +# Finally, we can plot the AIC and BIC values for the different alpha values. +# The vertical lines in the plot correspond to the alpha chosen for each +# criterion. The selected alpha corresponds to the minimum of the AIC or BIC +# criterion. +ax = results.plot() +ax.vlines( + alpha_aic, + results["AIC criterion"].min(), + results["AIC criterion"].max(), + label="alpha: AIC estimate", + linestyles="--", + color="tab:blue", +) +ax.vlines( + alpha_bic, + results["BIC criterion"].min(), + results["BIC criterion"].max(), + label="alpha: BIC estimate", + linestyle="--", + color="tab:orange", +) +ax.set_xlabel(r"$\alpha$") +ax.set_ylabel("criterion") +ax.set_xscale("log") +ax.legend() +_ = ax.set_title( + f"Information-criterion for model selection (training time {fit_time:.2f}s)" +) -# Compute paths -print("Computing regularization path using the coordinate descent lasso...") -t1 = time.time() -model = LassoCV(cv=20).fit(X, y) -t_lasso_cv = time.time() - t1 +# %% +# Model selection with an information-criterion is very fast. It relies on +# computing the criterion on the in-sample set provided to `fit`. Both criteria +# estimate the model generalization error based on the training set error and +# penalize this overly optimistic error. However, this penalty relies on a +# proper estimation of the degrees of freedom and the noise variance. Both are +# derived for large samples (asymptotic results) and assume the model is +# correct, i.e. that the data are actually generated by this model. +# +# These models also tend to break when the problem is badly conditioned (more +# features than samples). It is then required to provide an estimate of the +# noise variance. +# +# Selecting Lasso via cross-validation +# ------------------------------------ +# The Lasso estimator can be implemented with different solvers: coordinate +# descent and least angle regression. They differ with regards to their +# execution speed and sources of numerical errors. +# +# In scikit-learn, two different estimators are available with integrated +# cross-validation: :class:`~sklearn.linear_model.LassoCV` and +# :class:`~sklearn.linear_model.LassoLarsCV` that respectively solve the +# problem with coordinate descent and least angle regression. +# +# In the remainder of this section, we will present both approaches. For both +# algorithms, we will use a 20-fold cross-validation strategy. +# +# Lasso via coordinate descent +# ............................ +# Let's start by making the hyperparameter tuning using +# :class:`~sklearn.linear_model.LassoCV`. +from sklearn.linear_model import LassoCV + +start_time = time.time() +model = make_pipeline(StandardScaler(), LassoCV(cv=20)).fit(X, y) +fit_time = time.time() - start_time + +# %% +import matplotlib.pyplot as plt -# Display results -plt.figure() ymin, ymax = 2300, 3800 -plt.semilogx(model.alphas_ + EPSILON, model.mse_path_, ":") +lasso = model[-1] +plt.semilogx(lasso.alphas_, lasso.mse_path_, linestyle=":") plt.plot( - model.alphas_ + EPSILON, - model.mse_path_.mean(axis=-1), - "k", + lasso.alphas_, + lasso.mse_path_.mean(axis=-1), + color="black", label="Average across the folds", linewidth=2, ) -plt.axvline( - model.alpha_ + EPSILON, linestyle="--", color="k", label="alpha: CV estimate" -) - -plt.legend() +plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha: CV estimate") +plt.ylim(ymin, ymax) plt.xlabel(r"$\alpha$") plt.ylabel("Mean square error") -plt.title( - "Mean square error on each fold: coordinate descent (train time: %.2fs)" - % t_lasso_cv +plt.legend() +_ = plt.title( + f"Mean square error on each fold: coordinate descent (train time: {fit_time:.2f}s)" ) -plt.axis("tight") -plt.ylim(ymin, ymax) -# ############################################################################# -# LassoLarsCV: least angle regression +# %% +# Lasso via least angle regression +# ................................ +# Let's start by making the hyperparameter tuning using +# :class:`~sklearn.linear_model.LassoLarsCV`. +from sklearn.linear_model import LassoLarsCV -# Compute paths -print("Computing regularization path using the Lars lasso...") -t1 = time.time() -model = LassoLarsCV(cv=20, normalize=False).fit(X, y) -t_lasso_lars_cv = time.time() - t1 +start_time = time.time() +model = make_pipeline(StandardScaler(), LassoLarsCV(cv=20, normalize=False)).fit(X, y) +fit_time = time.time() - start_time -# Display results -plt.figure() -plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_, ":") +# %% +lasso = model[-1] +plt.semilogx(lasso.cv_alphas_, lasso.mse_path_, ":") plt.semilogx( - model.cv_alphas_ + EPSILON, - model.mse_path_.mean(axis=-1), - "k", + lasso.cv_alphas_, + lasso.mse_path_.mean(axis=-1), + color="black", label="Average across the folds", linewidth=2, ) -plt.axvline(model.alpha_, linestyle="--", color="k", label="alpha CV") -plt.legend() +plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha CV") +plt.ylim(ymin, ymax) plt.xlabel(r"$\alpha$") plt.ylabel("Mean square error") -plt.title("Mean square error on each fold: Lars (train time: %.2fs)" % t_lasso_lars_cv) -plt.axis("tight") -plt.ylim(ymin, ymax) - -plt.show() +plt.legend() +_ = plt.title(f"Mean square error on each fold: Lars (train time: {fit_time:.2f}s)") + +# %% +# Summary of cross-validation approach +# .................................... +# Both algorithms give roughly the same results. +# +# Lars computes a solution path only for each kink in the path. As a result, it +# is very efficient when there are only of few kinks, which is the case if +# there are few features or samples. Also, it is able to compute the full path +# without setting any hyperparameter. On the opposite, coordinate descent +# computes the path points on a pre-specified grid (here we use the default). +# Thus it is more efficient if the number of grid points is smaller than the +# number of kinks in the path. Such a strategy can be interesting if the number +# of features is really large and there are enough samples to be selected in +# each of the cross-validation fold. In terms of numerical errors, for heavily +# correlated variables, Lars will accumulate more errors, while the coordinate +# descent algorithm will only sample the path on a grid. +# +# Note how the optimal value of alpha varies for each fold. This illustrates +# why nested-cross validation is a good strategy when trying to evaluate the +# performance of a method for which a parameter is chosen by cross-validation: +# this choice of parameter may not be optimal for a final evaluation on +# unseen test set only. +# +# Conclusion +# ---------- +# In this tutorial, we presented two approaches for selecting the best +# hyperparameter `alpha`: one strategy finds the optimal value of `alpha` +# by only using the training set and some information criterion, and another +# strategy is based on cross-validation. +# +# In this example, both approaches are working similarly. The in-sample +# hyperparameter selection even shows its efficacy in terms of computational +# performance. However, it can only be used when the number of samples is large +# enough compared to the number of features. +# +# That's why hyperparameter optimization via cross-validation is a safe +# strategy: it works in different settings. diff --git a/sklearn/linear_model/_least_angle.py b/sklearn/linear_model/_least_angle.py index 351aa20f549c2..1780cdf2ccf48 100644 --- a/sklearn/linear_model/_least_angle.py +++ b/sklearn/linear_model/_least_angle.py @@ -19,6 +19,7 @@ from ._base import LinearModel from ._base import _deprecate_normalize +from ._base import LinearRegression from ..base import RegressorMixin, MultiOutputMixin # mypy error: Module 'sklearn.utils' has no attribute 'arrayfuncs' @@ -1961,17 +1962,17 @@ class LassoLarsIC(LassoLars): (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 - AIC is the Akaike information criterion and BIC is the Bayes - Information criterion. Such criteria are useful to select the value + AIC is the Akaike information criterion [2]_ and BIC is the Bayes + Information criterion [3]_. Such criteria are useful to select the value of the regularization parameter by making a trade-off between the goodness of fit and the complexity of the model. A good model should explain well the data while being simple. - Read more in the :ref:`User Guide `. + Read more in the :ref:`User Guide `. Parameters ---------- - criterion : {'bic' , 'aic'}, default='aic' + criterion : {'aic', 'bic'}, default='aic' The type of criterion to use. fit_intercept : bool, default=True @@ -2025,6 +2026,13 @@ class LassoLarsIC(LassoLars): As a consequence using LassoLarsIC only makes sense for problems where a sparse solution is expected and/or reached. + noise_variance : float, default=None + The estimated noise variance of the data. If `None`, an unbiased + estimate is computed by an OLS model. However, it is only possible + in the case where `n_samples > n_features + fit_intercept`. + + .. versionadded:: 1.1 + Attributes ---------- coef_ : array-like of shape (n_features,) @@ -2049,8 +2057,13 @@ class LassoLarsIC(LassoLars): criterion_ : array-like of shape (n_alphas,) The value of the information criteria ('aic', 'bic') across all alphas. The alpha which has the smallest information criterion is - chosen. This value is larger by a factor of ``n_samples`` compared to - Eqns. 2.15 and 2.16 in (Zou et al, 2007). + chosen, as specified in [1]_. + + noise_variance_ : float + The estimated noise variance from the data used to compute the + criterion. + + .. versionadded:: 1.1 n_features_in_ : int Number of features seen during :term:`fit`. @@ -2078,20 +2091,31 @@ class LassoLarsIC(LassoLars): Notes ----- - The estimation of the number of degrees of freedom is given by: + The number of degrees of freedom is computed as in [1]_. + + To have more details regarding the mathematical formulation of the + AIC and BIC criteria, please refer to :ref:`User Guide `. - "On the degrees of freedom of the lasso" - Hui Zou, Trevor Hastie, and Robert Tibshirani - Ann. Statist. Volume 35, Number 5 (2007), 2173-2192. + References + ---------- + .. [1] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani. + "On the degrees of freedom of the lasso." + The Annals of Statistics 35.5 (2007): 2173-2192. + <0712.0881>` - https://en.wikipedia.org/wiki/Akaike_information_criterion - https://en.wikipedia.org/wiki/Bayesian_information_criterion + .. [2] `Wikipedia entry on the Akaike information criterion + `_ + + .. [3] `Wikipedia entry on the Bayesian information criterion + `_ Examples -------- >>> from sklearn import linear_model >>> reg = linear_model.LassoLarsIC(criterion='bic', normalize=False) - >>> reg.fit([[-1, 1], [0, 0], [1, 1]], [-1.1111, 0, -1.1111]) + >>> X = [[-2, 2], [-1, 1], [0, 0], [1, 1], [2, 2]] + >>> y = [-2.2222, -1.1111, 0, -1.1111, -2.2222] + >>> reg.fit(X, y) LassoLarsIC(criterion='bic', normalize=False) >>> print(reg.coef_) [ 0. -1.11...] @@ -2109,6 +2133,7 @@ def __init__( eps=np.finfo(float).eps, copy_X=True, positive=False, + noise_variance=None, ): self.criterion = criterion self.fit_intercept = fit_intercept @@ -2120,6 +2145,7 @@ def __init__( self.precompute = precompute self.eps = eps self.fit_path = True + self.noise_variance = noise_variance def _more_tags(self): return {"multioutput": False} @@ -2177,17 +2203,17 @@ def fit(self, X, y, copy_X=None): n_samples = X.shape[0] if self.criterion == "aic": - K = 2 # AIC + criterion_factor = 2 elif self.criterion == "bic": - K = log(n_samples) # BIC + criterion_factor = log(n_samples) else: - raise ValueError("criterion should be either bic or aic") - - R = y[:, np.newaxis] - np.dot(X, coef_path_) # residuals - mean_squared_error = np.mean(R ** 2, axis=0) - sigma2 = np.var(y) + raise ValueError( + f"criterion should be either bic or aic, got {self.criterion!r}" + ) - df = np.zeros(coef_path_.shape[1], dtype=int) # Degrees of freedom + residuals = y[:, np.newaxis] - np.dot(X, coef_path_) + residuals_sum_squares = np.sum(residuals ** 2, axis=0) + degrees_of_freedom = np.zeros(coef_path_.shape[1], dtype=int) for k, coef in enumerate(coef_path_.T): mask = np.abs(coef) > np.finfo(coef.dtype).eps if not np.any(mask): @@ -2195,16 +2221,61 @@ def fit(self, X, y, copy_X=None): # get the number of degrees of freedom equal to: # Xc = X[:, mask] # Trace(Xc * inv(Xc.T, Xc) * Xc.T) ie the number of non-zero coefs - df[k] = np.sum(mask) + degrees_of_freedom[k] = np.sum(mask) self.alphas_ = alphas_ - eps64 = np.finfo("float64").eps + + if self.noise_variance is None: + self.noise_variance_ = self._estimate_noise_variance( + X, y, positive=self.positive + ) + else: + self.noise_variance_ = self.noise_variance + self.criterion_ = ( - n_samples * mean_squared_error / (sigma2 + eps64) + K * df - ) # Eqns. 2.15--16 in (Zou et al, 2007) + n_samples * np.log(2 * np.pi * self.noise_variance_) + + residuals_sum_squares / self.noise_variance_ + + criterion_factor * degrees_of_freedom + ) n_best = np.argmin(self.criterion_) self.alpha_ = alphas_[n_best] self.coef_ = coef_path_[:, n_best] self._set_intercept(Xmean, ymean, Xstd) return self + + def _estimate_noise_variance(self, X, y, positive): + """Compute an estimate of the variance with an OLS model. + + Parameters + ---------- + X : ndarray of shape (n_samples, n_features) + Data to be fitted by the OLS model. We expect the data to be + centered. + + y : ndarray of shape (n_samples,) + Associated target. + + positive : bool, default=False + Restrict coefficients to be >= 0. This should be inline with + the `positive` parameter from `LassoLarsIC`. + + Returns + ------- + noise_variance : float + An estimator of the noise variance of an OLS model. + """ + if X.shape[0] <= X.shape[1] + self.fit_intercept: + raise ValueError( + f"You are using {self.__class__.__name__} in the case where the number " + "of samples is smaller than the number of features. In this setting, " + "getting a good estimate for the variance of the noise is not " + "possible. Provide an estimate of the noise variance in the " + "constructor." + ) + # X and y are already centered and we don't need to fit with an intercept + ols_model = LinearRegression(positive=positive, fit_intercept=False) + y_pred = ols_model.fit(X, y).predict(X) + return np.sum((y - y_pred) ** 2) / ( + X.shape[0] - X.shape[1] - self.fit_intercept + ) diff --git a/sklearn/linear_model/tests/test_least_angle.py b/sklearn/linear_model/tests/test_least_angle.py index 1e4c39cfe254d..0db0a2fbb29ff 100644 --- a/sklearn/linear_model/tests/test_least_angle.py +++ b/sklearn/linear_model/tests/test_least_angle.py @@ -5,6 +5,8 @@ from scipy import linalg from sklearn.base import clone from sklearn.model_selection import train_test_split +from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler from sklearn.utils._testing import assert_allclose from sklearn.utils._testing import assert_array_almost_equal from sklearn.utils._testing import ignore_warnings @@ -898,8 +900,8 @@ def test_copy_X_with_auto_gram(): def test_lars_dtype_match(LARS, has_coef_path, args, dtype): # The test ensures that the fit method preserves input dtype rng = np.random.RandomState(0) - X = rng.rand(6, 6).astype(dtype) - y = rng.rand(6).astype(dtype) + X = rng.rand(20, 6).astype(dtype) + y = rng.rand(20).astype(dtype) model = LARS(**args) model.fit(X, y) @@ -928,8 +930,8 @@ def test_lars_numeric_consistency(LARS, has_coef_path, args): atol = 1e-5 rng = np.random.RandomState(0) - X_64 = rng.rand(6, 6) - y_64 = rng.rand(6) + X_64 = rng.rand(10, 6) + y_64 = rng.rand(10) model_64 = LARS(**args).fit(X_64, y_64) model_32 = LARS(**args).fit(X_64.astype(np.float32), y_64.astype(np.float32)) @@ -938,3 +940,44 @@ def test_lars_numeric_consistency(LARS, has_coef_path, args): if has_coef_path: assert_allclose(model_64.coef_path_, model_32.coef_path_, rtol=rtol, atol=atol) assert_allclose(model_64.intercept_, model_32.intercept_, rtol=rtol, atol=atol) + + +@pytest.mark.parametrize("criterion", ["aic", "bic"]) +def test_lassolarsic_alpha_selection(criterion): + """Check that we properly compute the AIC and BIC score. + + In this test, we reproduce the example of the Fig. 2 of Zou et al. + (reference [1] in LassoLarsIC) In this example, only 7 features should be + selected. + """ + model = make_pipeline( + StandardScaler(), LassoLarsIC(criterion=criterion, normalize=False) + ) + model.fit(X, y) + + best_alpha_selected = np.argmin(model[-1].criterion_) + assert best_alpha_selected == 7 + + +@pytest.mark.parametrize("fit_intercept", [True, False]) +def test_lassolarsic_noise_variance(fit_intercept): + """Check the behaviour when `n_samples` < `n_features` and that one needs + to provide the noise variance.""" + rng = np.random.RandomState(0) + X, y = datasets.make_regression( + n_samples=10, n_features=11 - fit_intercept, random_state=rng + ) + + model = make_pipeline( + StandardScaler(), LassoLarsIC(fit_intercept=fit_intercept, normalize=False) + ) + + err_msg = ( + "You are using LassoLarsIC in the case where the number of samples is smaller" + " than the number of features" + ) + with pytest.raises(ValueError, match=err_msg): + model.fit(X, y) + + model.set_params(lassolarsic__noise_variance=1.0) + model.fit(X, y).predict(X) diff --git a/sklearn/mixture/_gaussian_mixture.py b/sklearn/mixture/_gaussian_mixture.py index 995366b247778..42b76e05de6ae 100644 --- a/sklearn/mixture/_gaussian_mixture.py +++ b/sklearn/mixture/_gaussian_mixture.py @@ -813,6 +813,9 @@ def _n_parameters(self): def bic(self, X): """Bayesian information criterion for the current model on the input X. + You can refer to this :ref:`mathematical section ` for more + details regarding the formulation of the BIC used. + Parameters ---------- X : array of shape (n_samples, n_dimensions) @@ -830,6 +833,9 @@ def bic(self, X): def aic(self, X): """Akaike information criterion for the current model on the input X. + You can refer to this :ref:`mathematical section ` for more + details regarding the formulation of the AIC used. + Parameters ---------- X : array of shape (n_samples, n_dimensions) diff --git a/sklearn/utils/estimator_checks.py b/sklearn/utils/estimator_checks.py index 074c9e38b77a2..37ab3dc86d642 100644 --- a/sklearn/utils/estimator_checks.py +++ b/sklearn/utils/estimator_checks.py @@ -659,6 +659,11 @@ def _set_checking_parameters(estimator): # This is ugly :-/ estimator.n_components = 1 + if name == "LassoLarsIC": + # Noise variance estimation does not work when `n_samples < n_features`. + # We need to provide the noise variance explicitly. + estimator.set_params(noise_variance=1.0) + if hasattr(estimator, "n_clusters"): estimator.n_clusters = min(estimator.n_clusters, 2)