diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index 18e1bf468dc62..7243990bb5ffe 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -298,6 +298,7 @@ features, it is often faster than :class:`LassoCV`.
 
 .. centered:: |lasso_cv_1| |lasso_cv_2|
 
+.. _lasso_lars_ic:
 
 Information-criteria based model selection
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -306,22 +307,92 @@ Alternatively, the estimator :class:`LassoLarsIC` proposes to use the
 Akaike information criterion (AIC) and the Bayes Information criterion (BIC).
 It is a computationally cheaper alternative to find the optimal value of alpha
 as the regularization path is computed only once instead of k+1 times
-when using k-fold cross-validation. However, such criteria needs a
-proper estimation of the degrees of freedom of the solution, are
-derived for large samples (asymptotic results) and assume the model
-is correct, i.e. that the data are generated by this model.
-They also tend to break when the problem is badly conditioned
-(more features than samples).
-
-.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_001.png
-    :target: ../auto_examples/linear_model/plot_lasso_model_selection.html
+when using k-fold cross-validation.
+
+Indeed, these criteria are computed on the in-sample training set. In short,
+they penalize the over-optimistic scores of the different Lasso models by
+their flexibility (cf. to "Mathematical details" section below).
+
+However, such criteria need a proper estimation of the degrees of freedom of
+the solution, are derived for large samples (asymptotic results) and assume the
+correct model is candidates under investigation. They also tend to break when
+the problem is badly conditioned (e.g. more features than samples).
+
+.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_lasso_lars_ic_001.png
+    :target: ../auto_examples/linear_model/plot_lasso_lars_ic.html
     :align: center
     :scale: 50%
 
+.. _aic_bic:
+
+**Mathematical details**
+
+The definition of AIC (and thus BIC) might differ in the literature. In this
+section, we give more information regarding the criterion computed in
+scikit-learn. The AIC criterion is defined as:
+
+.. math::
+    AIC = -2 \log(\hat{L}) + 2 d
+
+where :math:`\hat{L}` is the maximum likelihood of the model and
+:math:`d` is the number of parameters (as well referred to as degrees of
+freedom in the previous section).
+
+The definition of BIC replace the constant :math:`2` by :math:`\log(N)`:
+
+.. math::
+    BIC = -2 \log(\hat{L}) + \log(N) d
+
+where :math:`N` is the number of samples.
+
+For a linear Gaussian model, the maximum log-likelihood is defined as:
+
+.. math::
+    \log(\hat{L}) = - \frac{n}{2} \log(2 \pi) - \frac{n}{2} \ln(\sigma^2) - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{2\sigma^2}
+
+where :math:`\sigma^2` is an estimate of the noise variance,
+:math:`y_i` and :math:`\hat{y}_i` are respectively the true and predicted
+targets, and :math:`n` is the number of samples.
+
+Plugging the maximum log-likelihood in the AIC formula yields:
+
+.. math::
+    AIC = n \log(2 \pi \sigma^2) + \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sigma^2} + 2 d
+
+The first term of the above expression is sometimes discarded since it is a
+constant when :math:`\sigma^2` is provided. In addition,
+it is sometimes stated that the AIC is equivalent to the :math:`C_p` statistic
+[12]_. In a strict sense, however, it is equivalent only up to some constant
+and a multiplicative factor.
+
+At last, we mentioned above that :math:`\sigma^2` is an estimate of the
+noise variance. In :class:`LassoLarsIC` when the parameter `noise_variance` is
+not provided (default), the noise variance is estimated via the unbiased
+estimator [13]_ defined as:
+
+.. math::
+    \sigma^2 = \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n - p}
+
+where :math:`p` is the number of features and :math:`\hat{y}_i` is the
+predicted target using an ordinary least squares regression. Note, that this
+formula is valid only when `n_samples > n_features`.
 
 .. topic:: Examples:
 
   * :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py`
+  * :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_lars_ic.py`
+
+.. topic:: References
+
+  .. [12] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani.
+           "On the degrees of freedom of the lasso."
+           The Annals of Statistics 35.5 (2007): 2173-2192.
+           <0712.0881.pdf>`
+
+  .. [13] `Cherkassky, Vladimir, and Yunqian Ma.
+           "Comparison of model selection for regression."
+           Neural computation 15.7 (2003): 1691-1714.
+           <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.392.8794&rep=rep1&type=pdf>`_
 
 Comparison with the regularization parameter of SVM
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -934,8 +1005,8 @@ to warm-starting (see :term:`Glossary <warm_start>`).
 
     .. [6] Mark Schmidt, Nicolas Le Roux, and Francis Bach: `Minimizing Finite Sums with the Stochastic Average Gradient. <https://hal.inria.fr/hal-00860051/document>`_
 
-    .. [7] Aaron Defazio, Francis Bach, Simon Lacoste-Julien: 
-        :arxiv:`SAGA: A Fast Incremental Gradient Method With Support for 
+    .. [7] Aaron Defazio, Francis Bach, Simon Lacoste-Julien:
+        :arxiv:`SAGA: A Fast Incremental Gradient Method With Support for
         Non-Strongly Convex Composite Objectives. <1407.0202>`
 
     .. [8] https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm
diff --git a/doc/whats_new/v1.1.rst b/doc/whats_new/v1.1.rst
index cc607aedb3946..94fc0ca160396 100644
--- a/doc/whats_new/v1.1.rst
+++ b/doc/whats_new/v1.1.rst
@@ -141,6 +141,21 @@ Changelog
   multilabel classification.
   :pr:`19689` by :user:`Guillaume Lemaitre <glemaitre>`.
 
+:mod:`sklearn.linear_model`
+...........................
+
+- |API| :class:`linear_model.LassoLarsIC` now exposes `noise_variance` as
+  a parameter in order to provide an estimate of the noise variance.
+  This is particularly relevant when `n_features > n_samples` and the
+  estimator of the noise variance cannot be computed.
+  :pr:`21481` by :user:`Guillaume Lemaitre <glemaitre>`
+
+- |Fix| :class:`linear_model.LassoLarsIC` now correctly computes AIC
+  and BIC. An error is now raised when `n_features > n_samples` and
+  when the noise variance is not provided.
+  :pr:`21481` by :user:`Guillaume Lemaitre <glemaitre>` and
+  :user:`Andrés Babino <ababino>`.
+
 :mod:`sklearn.metrics`
 ......................
 
diff --git a/examples/linear_model/plot_lasso_lars_ic.py b/examples/linear_model/plot_lasso_lars_ic.py
new file mode 100644
index 0000000000000..2f5392696ecc9
--- /dev/null
+++ b/examples/linear_model/plot_lasso_lars_ic.py
@@ -0,0 +1,123 @@
+"""
+==============================================
+Lasso model selection via information criteria
+==============================================
+
+This example reproduces the example of Fig. 2 of [ZHT2007]_. A
+:class:`~sklearn.linear_model.LassoLarsIC` estimator is fit on a
+diabetes dataset and the AIC and the BIC criteria are used to select
+the best model.
+
+.. note::
+    It is important to note that the optimization to find `alpha` with
+    :class:`~sklearn.linear_model.LassoLarsIC` relies on the AIC or BIC
+    criteria that are computed in-sample, thus on the training set directly.
+    This approach differs from the cross-validation procedure. For a comparison
+    of the two approaches, you can refer to the following example:
+    :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py`.
+
+.. topic:: References
+
+    .. [ZHT2007] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani.
+       "On the degrees of freedom of the lasso."
+       The Annals of Statistics 35.5 (2007): 2173-2192.
+       <0712.0881>`
+"""
+
+# Author: Alexandre Gramfort
+#         Guillaume Lemaitre
+# License: BSD 3 clause
+
+# %%
+import sklearn
+
+sklearn.set_config(display="diagram")
+
+# %%
+# We will use the diabetes dataset.
+from sklearn.datasets import load_diabetes
+
+X, y = load_diabetes(return_X_y=True, as_frame=True)
+n_samples = X.shape[0]
+X.head()
+
+# %%
+# Scikit-learn provides an estimator called
+# :class:`~sklearn.linear_model.LinearLarsIC` that uses either Akaike's
+# information criterion (AIC) or the Bayesian information criterion (BIC) to
+# select the best model. Before fitting
+# this model, we will scale the dataset.
+#
+# In the following, we are going to fit two models to compare the values
+# reported by AIC and BIC.
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LassoLarsIC
+from sklearn.pipeline import make_pipeline
+
+lasso_lars_ic = make_pipeline(
+    StandardScaler(), LassoLarsIC(criterion="aic", normalize=False)
+).fit(X, y)
+
+
+# %%
+# To be in line with the defintion in [ZHT2007]_, we need to rescale the
+# AIC and the BIC. Indeed, Zou et al. are ignoring some constant terms
+# compared to the original definition of AIC derived from the maximum
+# log-likelihood of a linear model. You can refer to
+# :ref:`mathematical detail section for the User Guide <lasso_lars_ic>`.
+def zou_et_al_criterion_rescaling(criterion, n_samples, noise_variance):
+    """Rescale the information criterion to follow the definition of Zou et al."""
+    return criterion - n_samples * np.log(2 * np.pi * noise_variance) - n_samples
+
+
+# %%
+import numpy as np
+
+aic_criterion = zou_et_al_criterion_rescaling(
+    lasso_lars_ic[-1].criterion_,
+    n_samples,
+    lasso_lars_ic[-1].noise_variance_,
+)
+
+index_alpha_path_aic = np.flatnonzero(
+    lasso_lars_ic[-1].alphas_ == lasso_lars_ic[-1].alpha_
+)[0]
+
+# %%
+lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y)
+
+bic_criterion = zou_et_al_criterion_rescaling(
+    lasso_lars_ic[-1].criterion_,
+    n_samples,
+    lasso_lars_ic[-1].noise_variance_,
+)
+
+index_alpha_path_bic = np.flatnonzero(
+    lasso_lars_ic[-1].alphas_ == lasso_lars_ic[-1].alpha_
+)[0]
+
+# %%
+# Now that we collected the AIC and BIC, we can as well check that the minima
+# of both criteria happen at the same alpha. Then, we can simplify the
+# following plot.
+index_alpha_path_aic == index_alpha_path_bic
+
+# %%
+# Finally, we can plot the AIC and BIC criterion and the subsequent selected
+# regularization parameter.
+import matplotlib.pyplot as plt
+
+plt.plot(aic_criterion, color="tab:blue", marker="o", label="AIC criterion")
+plt.plot(bic_criterion, color="tab:orange", marker="o", label="BIC criterion")
+plt.vlines(
+    index_alpha_path_bic,
+    aic_criterion.min(),
+    aic_criterion.max(),
+    color="black",
+    linestyle="--",
+    label="Selected alpha",
+)
+plt.legend()
+plt.ylabel("Information criterion")
+plt.xlabel("Lasso model sequence")
+_ = plt.title("Lasso model selection via AIC and BIC")
diff --git a/examples/linear_model/plot_lasso_model_selection.py b/examples/linear_model/plot_lasso_model_selection.py
index b2792c92f15bd..7cc05055b22d9 100644
--- a/examples/linear_model/plot_lasso_model_selection.py
+++ b/examples/linear_model/plot_lasso_model_selection.py
@@ -1,171 +1,258 @@
 """
-===================================================
-Lasso model selection: Cross-Validation / AIC / BIC
-===================================================
-
-Use the Akaike information criterion (AIC), the Bayes Information
-criterion (BIC) and cross-validation to select an optimal value
-of the regularization parameter alpha of the :ref:`lasso` estimator.
-
-Results obtained with LassoLarsIC are based on AIC/BIC criteria.
-
-Information-criterion based model selection is very fast, but it
-relies on a proper estimation of degrees of freedom, are
-derived for large samples (asymptotic results) and assume the model
-is correct, i.e. that the data are actually generated by this model.
-They also tend to break when the problem is badly conditioned
-(more features than samples).
-
-For cross-validation, we use 20-fold with 2 algorithms to compute the
-Lasso path: coordinate descent, as implemented by the LassoCV class, and
-Lars (least angle regression) as implemented by the LassoLarsCV class.
-Both algorithms give roughly the same results. They differ with regards
-to their execution speed and sources of numerical errors.
-
-Lars computes a path solution only for each kink in the path. As a
-result, it is very efficient when there are only of few kinks, which is
-the case if there are few features or samples. Also, it is able to
-compute the full path without setting any meta parameter. On the
-opposite, coordinate descent compute the path points on a pre-specified
-grid (here we use the default). Thus it is more efficient if the number
-of grid points is smaller than the number of kinks in the path. Such a
-strategy can be interesting if the number of features is really large
-and there are enough samples to select a large amount. In terms of
-numerical errors, for heavily correlated variables, Lars will accumulate
-more errors, while the coordinate descent algorithm will only sample the
-path on a grid.
-
-Note how the optimal value of alpha varies for each fold. This
-illustrates why nested-cross validation is necessary when trying to
-evaluate the performance of a method for which a parameter is chosen by
-cross-validation: this choice of parameter may not be optimal for unseen
-data.
+=================================================
+Lasso model selection: AIC-BIC / cross-validation
+=================================================
 
+This example focuses on model selection for Lasso models that are
+linear models with an L1 penalty for regression problems.
+
+Indeed, several strategies can be used to select the value of the
+regularization parameter: via cross-validation or using an information
+criterion, namely AIC or BIC.
+
+In what follows, we will discuss in details the different strategies.
 """
 
-# Author: Olivier Grisel, Gael Varoquaux, Alexandre Gramfort
+# Author: Olivier Grisel
+#         Gael Varoquaux
+#         Alexandre Gramfort
+#         Guillaume Lemaitre
 # License: BSD 3 clause
 
-import time
+# %%
+import sklearn
 
-import numpy as np
-import matplotlib.pyplot as plt
+sklearn.set_config(display="diagram")
 
-from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
-from sklearn import datasets
+# %%
+# Dataset
+# -------
+# In this example, we will use the diabetes dataset.
+from sklearn.datasets import load_diabetes
 
-# This is to avoid division by zero while doing np.log10
-EPSILON = 1e-4
+X, y = load_diabetes(return_X_y=True, as_frame=True)
+X.head()
 
-X, y = datasets.load_diabetes(return_X_y=True)
+# %%
+# In addition, we add some random features to the original data to
+# better illustrate the feature selection performed by the Lasso model.
+import numpy as np
+import pandas as pd
 
 rng = np.random.RandomState(42)
-X = np.c_[X, rng.randn(X.shape[0], 14)]  # add some bad features
-
-# normalize data as done by Lars to allow for comparison
-X /= np.sqrt(np.sum(X ** 2, axis=0))
-
-# #############################################################################
-# LassoLarsIC: least angle regression with BIC/AIC criterion
-
-model_bic = LassoLarsIC(criterion="bic", normalize=False)
-t1 = time.time()
-model_bic.fit(X, y)
-t_bic = time.time() - t1
-alpha_bic_ = model_bic.alpha_
-
-model_aic = LassoLarsIC(criterion="aic", normalize=False)
-model_aic.fit(X, y)
-alpha_aic_ = model_aic.alpha_
-
-
-def plot_ic_criterion(model, name, color):
-    criterion_ = model.criterion_
-    plt.semilogx(
-        model.alphas_ + EPSILON,
-        criterion_,
-        "--",
-        color=color,
-        linewidth=3,
-        label="%s criterion" % name,
-    )
-    plt.axvline(
-        model.alpha_ + EPSILON,
-        color=color,
-        linewidth=3,
-        label="alpha: %s estimate" % name,
-    )
-    plt.xlabel(r"$\alpha$")
-    plt.ylabel("criterion")
-
-
-plt.figure()
-plot_ic_criterion(model_aic, "AIC", "b")
-plot_ic_criterion(model_bic, "BIC", "r")
-plt.legend()
-plt.title("Information-criterion for model selection (training time %.3fs)" % t_bic)
-
-# #############################################################################
-# LassoCV: coordinate descent
+n_random_features = 14
+X_random = pd.DataFrame(
+    rng.randn(X.shape[0], n_random_features),
+    columns=[f"random_{i:02d}" for i in range(n_random_features)],
+)
+X = pd.concat([X, X_random], axis=1)
+# Show only a subset of the columns
+X[X.columns[::3]].head()
+
+# %%
+# Selecting Lasso via an information criterion
+# --------------------------------------------
+# :class:`~sklearn.linear_model.LassoLarsIC` provides a Lasso estimator that
+# uses the Akaike information criterion (AIC) or the Bayes information
+# criterion (BIC) to select the optimal value of the regularization
+# parameter alpha.
+#
+# Before fitting the model, we will standardize the data with a
+# :class:`~sklearn.preprocessing.StandardScaler`. In addition, we will
+# measure the time to fit and tune the hyperparameter alpha in order to
+# compare with the cross-validation strategy.
+#
+# We will first fit a Lasso model with the AIC criterion.
+import time
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LassoLarsIC
+from sklearn.pipeline import make_pipeline
+
+start_time = time.time()
+lasso_lars_ic = make_pipeline(
+    StandardScaler(), LassoLarsIC(criterion="aic", normalize=False)
+).fit(X, y)
+fit_time = time.time() - start_time
+
+# %%
+# We store the AIC metric for each value of alpha used during `fit`.
+results = pd.DataFrame(
+    {
+        "alphas": lasso_lars_ic[-1].alphas_,
+        "AIC criterion": lasso_lars_ic[-1].criterion_,
+    }
+).set_index("alphas")
+alpha_aic = lasso_lars_ic[-1].alpha_
+
+# %%
+# Now, we perform the same analysis using the BIC criterion.
+lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y)
+results["BIC criterion"] = lasso_lars_ic[-1].criterion_
+alpha_bic = lasso_lars_ic[-1].alpha_
+
+
+# %%
+# We can check which value of `alpha` leads to the minimum AIC and BIC.
+def highlight_min(x):
+    x_min = x.min()
+    return ["font-weight: bold" if v == x_min else "" for v in x]
+
+
+results.style.apply(highlight_min)
+
+# %%
+# Finally, we can plot the AIC and BIC values for the different alpha values.
+# The vertical lines in the plot correspond to the alpha chosen for each
+# criterion. The selected alpha corresponds to the minimum of the AIC or BIC
+# criterion.
+ax = results.plot()
+ax.vlines(
+    alpha_aic,
+    results["AIC criterion"].min(),
+    results["AIC criterion"].max(),
+    label="alpha: AIC estimate",
+    linestyles="--",
+    color="tab:blue",
+)
+ax.vlines(
+    alpha_bic,
+    results["BIC criterion"].min(),
+    results["BIC criterion"].max(),
+    label="alpha: BIC estimate",
+    linestyle="--",
+    color="tab:orange",
+)
+ax.set_xlabel(r"$\alpha$")
+ax.set_ylabel("criterion")
+ax.set_xscale("log")
+ax.legend()
+_ = ax.set_title(
+    f"Information-criterion for model selection (training time {fit_time:.2f}s)"
+)
 
-# Compute paths
-print("Computing regularization path using the coordinate descent lasso...")
-t1 = time.time()
-model = LassoCV(cv=20).fit(X, y)
-t_lasso_cv = time.time() - t1
+# %%
+# Model selection with an information-criterion is very fast. It relies on
+# computing the criterion on the in-sample set provided to `fit`. Both criteria
+# estimate the model generalization error based on the training set error and
+# penalize this overly optimistic error. However, this penalty relies on a
+# proper estimation of the degrees of freedom and the noise variance. Both are
+# derived for large samples (asymptotic results) and assume the model is
+# correct, i.e. that the data are actually generated by this model.
+#
+# These models also tend to break when the problem is badly conditioned (more
+# features than samples). It is then required to provide an estimate of the
+# noise variance.
+#
+# Selecting Lasso via cross-validation
+# ------------------------------------
+# The Lasso estimator can be implemented with different solvers: coordinate
+# descent and least angle regression. They differ with regards to their
+# execution speed and sources of numerical errors.
+#
+# In scikit-learn, two different estimators are available with integrated
+# cross-validation: :class:`~sklearn.linear_model.LassoCV` and
+# :class:`~sklearn.linear_model.LassoLarsCV` that respectively solve the
+# problem with coordinate descent and least angle regression.
+#
+# In the remainder of this section, we will present both approaches. For both
+# algorithms, we will use a 20-fold cross-validation strategy.
+#
+# Lasso via coordinate descent
+# ............................
+# Let's start by making the hyperparameter tuning using
+# :class:`~sklearn.linear_model.LassoCV`.
+from sklearn.linear_model import LassoCV
+
+start_time = time.time()
+model = make_pipeline(StandardScaler(), LassoCV(cv=20)).fit(X, y)
+fit_time = time.time() - start_time
+
+# %%
+import matplotlib.pyplot as plt
 
-# Display results
-plt.figure()
 ymin, ymax = 2300, 3800
-plt.semilogx(model.alphas_ + EPSILON, model.mse_path_, ":")
+lasso = model[-1]
+plt.semilogx(lasso.alphas_, lasso.mse_path_, linestyle=":")
 plt.plot(
-    model.alphas_ + EPSILON,
-    model.mse_path_.mean(axis=-1),
-    "k",
+    lasso.alphas_,
+    lasso.mse_path_.mean(axis=-1),
+    color="black",
     label="Average across the folds",
     linewidth=2,
 )
-plt.axvline(
-    model.alpha_ + EPSILON, linestyle="--", color="k", label="alpha: CV estimate"
-)
-
-plt.legend()
+plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha: CV estimate")
 
+plt.ylim(ymin, ymax)
 plt.xlabel(r"$\alpha$")
 plt.ylabel("Mean square error")
-plt.title(
-    "Mean square error on each fold: coordinate descent (train time: %.2fs)"
-    % t_lasso_cv
+plt.legend()
+_ = plt.title(
+    f"Mean square error on each fold: coordinate descent (train time: {fit_time:.2f}s)"
 )
-plt.axis("tight")
-plt.ylim(ymin, ymax)
 
-# #############################################################################
-# LassoLarsCV: least angle regression
+# %%
+# Lasso via least angle regression
+# ................................
+# Let's start by making the hyperparameter tuning using
+# :class:`~sklearn.linear_model.LassoLarsCV`.
+from sklearn.linear_model import LassoLarsCV
 
-# Compute paths
-print("Computing regularization path using the Lars lasso...")
-t1 = time.time()
-model = LassoLarsCV(cv=20, normalize=False).fit(X, y)
-t_lasso_lars_cv = time.time() - t1
+start_time = time.time()
+model = make_pipeline(StandardScaler(), LassoLarsCV(cv=20, normalize=False)).fit(X, y)
+fit_time = time.time() - start_time
 
-# Display results
-plt.figure()
-plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_, ":")
+# %%
+lasso = model[-1]
+plt.semilogx(lasso.cv_alphas_, lasso.mse_path_, ":")
 plt.semilogx(
-    model.cv_alphas_ + EPSILON,
-    model.mse_path_.mean(axis=-1),
-    "k",
+    lasso.cv_alphas_,
+    lasso.mse_path_.mean(axis=-1),
+    color="black",
     label="Average across the folds",
     linewidth=2,
 )
-plt.axvline(model.alpha_, linestyle="--", color="k", label="alpha CV")
-plt.legend()
+plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha CV")
 
+plt.ylim(ymin, ymax)
 plt.xlabel(r"$\alpha$")
 plt.ylabel("Mean square error")
-plt.title("Mean square error on each fold: Lars (train time: %.2fs)" % t_lasso_lars_cv)
-plt.axis("tight")
-plt.ylim(ymin, ymax)
-
-plt.show()
+plt.legend()
+_ = plt.title(f"Mean square error on each fold: Lars (train time: {fit_time:.2f}s)")
+
+# %%
+# Summary of cross-validation approach
+# ....................................
+# Both algorithms give roughly the same results.
+#
+# Lars computes a solution path only for each kink in the path. As a result, it
+# is very efficient when there are only of few kinks, which is the case if
+# there are few features or samples. Also, it is able to compute the full path
+# without setting any hyperparameter. On the opposite, coordinate descent
+# computes the path points on a pre-specified grid (here we use the default).
+# Thus it is more efficient if the number of grid points is smaller than the
+# number of kinks in the path. Such a strategy can be interesting if the number
+# of features is really large and there are enough samples to be selected in
+# each of the cross-validation fold. In terms of numerical errors, for heavily
+# correlated variables, Lars will accumulate more errors, while the coordinate
+# descent algorithm will only sample the path on a grid.
+#
+# Note how the optimal value of alpha varies for each fold. This illustrates
+# why nested-cross validation is a good strategy when trying to evaluate the
+# performance of a method for which a parameter is chosen by cross-validation:
+# this choice of parameter may not be optimal for a final evaluation on
+# unseen test set only.
+#
+# Conclusion
+# ----------
+# In this tutorial, we presented two approaches for selecting the best
+# hyperparameter `alpha`: one strategy finds the optimal value of `alpha`
+# by only using the training set and some information criterion, and another
+# strategy is based on cross-validation.
+#
+# In this example, both approaches are working similarly. The in-sample
+# hyperparameter selection even shows its efficacy in terms of computational
+# performance. However, it can only be used when the number of samples is large
+# enough compared to the number of features.
+#
+# That's why hyperparameter optimization via cross-validation is a safe
+# strategy: it works in different settings.
diff --git a/sklearn/linear_model/_least_angle.py b/sklearn/linear_model/_least_angle.py
index 351aa20f549c2..1780cdf2ccf48 100644
--- a/sklearn/linear_model/_least_angle.py
+++ b/sklearn/linear_model/_least_angle.py
@@ -19,6 +19,7 @@
 
 from ._base import LinearModel
 from ._base import _deprecate_normalize
+from ._base import LinearRegression
 from ..base import RegressorMixin, MultiOutputMixin
 
 # mypy error: Module 'sklearn.utils' has no attribute 'arrayfuncs'
@@ -1961,17 +1962,17 @@ class LassoLarsIC(LassoLars):
 
     (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1
 
-    AIC is the Akaike information criterion and BIC is the Bayes
-    Information criterion. Such criteria are useful to select the value
+    AIC is the Akaike information criterion [2]_ and BIC is the Bayes
+    Information criterion [3]_. Such criteria are useful to select the value
     of the regularization parameter by making a trade-off between the
     goodness of fit and the complexity of the model. A good model should
     explain well the data while being simple.
 
-    Read more in the :ref:`User Guide <least_angle_regression>`.
+    Read more in the :ref:`User Guide <lasso_lars_ic>`.
 
     Parameters
     ----------
-    criterion : {'bic' , 'aic'}, default='aic'
+    criterion : {'aic', 'bic'}, default='aic'
         The type of criterion to use.
 
     fit_intercept : bool, default=True
@@ -2025,6 +2026,13 @@ class LassoLarsIC(LassoLars):
         As a consequence using LassoLarsIC only makes sense for problems where
         a sparse solution is expected and/or reached.
 
+    noise_variance : float, default=None
+        The estimated noise variance of the data. If `None`, an unbiased
+        estimate is computed by an OLS model. However, it is only possible
+        in the case where `n_samples > n_features + fit_intercept`.
+
+        .. versionadded:: 1.1
+
     Attributes
     ----------
     coef_ : array-like of shape (n_features,)
@@ -2049,8 +2057,13 @@ class LassoLarsIC(LassoLars):
     criterion_ : array-like of shape (n_alphas,)
         The value of the information criteria ('aic', 'bic') across all
         alphas. The alpha which has the smallest information criterion is
-        chosen. This value is larger by a factor of ``n_samples`` compared to
-        Eqns. 2.15 and 2.16 in (Zou et al, 2007).
+        chosen, as specified in [1]_.
+
+    noise_variance_ : float
+        The estimated noise variance from the data used to compute the
+        criterion.
+
+        .. versionadded:: 1.1
 
     n_features_in_ : int
         Number of features seen during :term:`fit`.
@@ -2078,20 +2091,31 @@ class LassoLarsIC(LassoLars):
 
     Notes
     -----
-    The estimation of the number of degrees of freedom is given by:
+    The number of degrees of freedom is computed as in [1]_.
+
+    To have more details regarding the mathematical formulation of the
+    AIC and BIC criteria, please refer to :ref:`User Guide <lasso_lars_ic>`.
 
-    "On the degrees of freedom of the lasso"
-    Hui Zou, Trevor Hastie, and Robert Tibshirani
-    Ann. Statist. Volume 35, Number 5 (2007), 2173-2192.
+    References
+    ----------
+    .. [1] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani.
+            "On the degrees of freedom of the lasso."
+            The Annals of Statistics 35.5 (2007): 2173-2192.
+            <0712.0881>`
 
-    https://en.wikipedia.org/wiki/Akaike_information_criterion
-    https://en.wikipedia.org/wiki/Bayesian_information_criterion
+    .. [2] `Wikipedia entry on the Akaike information criterion
+            <https://en.wikipedia.org/wiki/Akaike_information_criterion>`_
+
+    .. [3] `Wikipedia entry on the Bayesian information criterion
+            <https://en.wikipedia.org/wiki/Bayesian_information_criterion>`_
 
     Examples
     --------
     >>> from sklearn import linear_model
     >>> reg = linear_model.LassoLarsIC(criterion='bic', normalize=False)
-    >>> reg.fit([[-1, 1], [0, 0], [1, 1]], [-1.1111, 0, -1.1111])
+    >>> X = [[-2, 2], [-1, 1], [0, 0], [1, 1], [2, 2]]
+    >>> y = [-2.2222, -1.1111, 0, -1.1111, -2.2222]
+    >>> reg.fit(X, y)
     LassoLarsIC(criterion='bic', normalize=False)
     >>> print(reg.coef_)
     [ 0.  -1.11...]
@@ -2109,6 +2133,7 @@ def __init__(
         eps=np.finfo(float).eps,
         copy_X=True,
         positive=False,
+        noise_variance=None,
     ):
         self.criterion = criterion
         self.fit_intercept = fit_intercept
@@ -2120,6 +2145,7 @@ def __init__(
         self.precompute = precompute
         self.eps = eps
         self.fit_path = True
+        self.noise_variance = noise_variance
 
     def _more_tags(self):
         return {"multioutput": False}
@@ -2177,17 +2203,17 @@ def fit(self, X, y, copy_X=None):
         n_samples = X.shape[0]
 
         if self.criterion == "aic":
-            K = 2  # AIC
+            criterion_factor = 2
         elif self.criterion == "bic":
-            K = log(n_samples)  # BIC
+            criterion_factor = log(n_samples)
         else:
-            raise ValueError("criterion should be either bic or aic")
-
-        R = y[:, np.newaxis] - np.dot(X, coef_path_)  # residuals
-        mean_squared_error = np.mean(R ** 2, axis=0)
-        sigma2 = np.var(y)
+            raise ValueError(
+                f"criterion should be either bic or aic, got {self.criterion!r}"
+            )
 
-        df = np.zeros(coef_path_.shape[1], dtype=int)  # Degrees of freedom
+        residuals = y[:, np.newaxis] - np.dot(X, coef_path_)
+        residuals_sum_squares = np.sum(residuals ** 2, axis=0)
+        degrees_of_freedom = np.zeros(coef_path_.shape[1], dtype=int)
         for k, coef in enumerate(coef_path_.T):
             mask = np.abs(coef) > np.finfo(coef.dtype).eps
             if not np.any(mask):
@@ -2195,16 +2221,61 @@ def fit(self, X, y, copy_X=None):
             # get the number of degrees of freedom equal to:
             # Xc = X[:, mask]
             # Trace(Xc * inv(Xc.T, Xc) * Xc.T) ie the number of non-zero coefs
-            df[k] = np.sum(mask)
+            degrees_of_freedom[k] = np.sum(mask)
 
         self.alphas_ = alphas_
-        eps64 = np.finfo("float64").eps
+
+        if self.noise_variance is None:
+            self.noise_variance_ = self._estimate_noise_variance(
+                X, y, positive=self.positive
+            )
+        else:
+            self.noise_variance_ = self.noise_variance
+
         self.criterion_ = (
-            n_samples * mean_squared_error / (sigma2 + eps64) + K * df
-        )  # Eqns. 2.15--16 in (Zou et al, 2007)
+            n_samples * np.log(2 * np.pi * self.noise_variance_)
+            + residuals_sum_squares / self.noise_variance_
+            + criterion_factor * degrees_of_freedom
+        )
         n_best = np.argmin(self.criterion_)
 
         self.alpha_ = alphas_[n_best]
         self.coef_ = coef_path_[:, n_best]
         self._set_intercept(Xmean, ymean, Xstd)
         return self
+
+    def _estimate_noise_variance(self, X, y, positive):
+        """Compute an estimate of the variance with an OLS model.
+
+        Parameters
+        ----------
+        X : ndarray of shape (n_samples, n_features)
+            Data to be fitted by the OLS model. We expect the data to be
+            centered.
+
+        y : ndarray of shape (n_samples,)
+            Associated target.
+
+        positive : bool, default=False
+            Restrict coefficients to be >= 0. This should be inline with
+            the `positive` parameter from `LassoLarsIC`.
+
+        Returns
+        -------
+        noise_variance : float
+            An estimator of the noise variance of an OLS model.
+        """
+        if X.shape[0] <= X.shape[1] + self.fit_intercept:
+            raise ValueError(
+                f"You are using {self.__class__.__name__} in the case where the number "
+                "of samples is smaller than the number of features. In this setting, "
+                "getting a good estimate for the variance of the noise is not "
+                "possible. Provide an estimate of the noise variance in the "
+                "constructor."
+            )
+        # X and y are already centered and we don't need to fit with an intercept
+        ols_model = LinearRegression(positive=positive, fit_intercept=False)
+        y_pred = ols_model.fit(X, y).predict(X)
+        return np.sum((y - y_pred) ** 2) / (
+            X.shape[0] - X.shape[1] - self.fit_intercept
+        )
diff --git a/sklearn/linear_model/tests/test_least_angle.py b/sklearn/linear_model/tests/test_least_angle.py
index 1e4c39cfe254d..0db0a2fbb29ff 100644
--- a/sklearn/linear_model/tests/test_least_angle.py
+++ b/sklearn/linear_model/tests/test_least_angle.py
@@ -5,6 +5,8 @@
 from scipy import linalg
 from sklearn.base import clone
 from sklearn.model_selection import train_test_split
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
 from sklearn.utils._testing import assert_allclose
 from sklearn.utils._testing import assert_array_almost_equal
 from sklearn.utils._testing import ignore_warnings
@@ -898,8 +900,8 @@ def test_copy_X_with_auto_gram():
 def test_lars_dtype_match(LARS, has_coef_path, args, dtype):
     # The test ensures that the fit method preserves input dtype
     rng = np.random.RandomState(0)
-    X = rng.rand(6, 6).astype(dtype)
-    y = rng.rand(6).astype(dtype)
+    X = rng.rand(20, 6).astype(dtype)
+    y = rng.rand(20).astype(dtype)
 
     model = LARS(**args)
     model.fit(X, y)
@@ -928,8 +930,8 @@ def test_lars_numeric_consistency(LARS, has_coef_path, args):
     atol = 1e-5
 
     rng = np.random.RandomState(0)
-    X_64 = rng.rand(6, 6)
-    y_64 = rng.rand(6)
+    X_64 = rng.rand(10, 6)
+    y_64 = rng.rand(10)
 
     model_64 = LARS(**args).fit(X_64, y_64)
     model_32 = LARS(**args).fit(X_64.astype(np.float32), y_64.astype(np.float32))
@@ -938,3 +940,44 @@ def test_lars_numeric_consistency(LARS, has_coef_path, args):
     if has_coef_path:
         assert_allclose(model_64.coef_path_, model_32.coef_path_, rtol=rtol, atol=atol)
     assert_allclose(model_64.intercept_, model_32.intercept_, rtol=rtol, atol=atol)
+
+
+@pytest.mark.parametrize("criterion", ["aic", "bic"])
+def test_lassolarsic_alpha_selection(criterion):
+    """Check that we properly compute the AIC and BIC score.
+
+    In this test, we reproduce the example of the Fig. 2 of Zou et al.
+    (reference [1] in LassoLarsIC) In this example, only 7 features should be
+    selected.
+    """
+    model = make_pipeline(
+        StandardScaler(), LassoLarsIC(criterion=criterion, normalize=False)
+    )
+    model.fit(X, y)
+
+    best_alpha_selected = np.argmin(model[-1].criterion_)
+    assert best_alpha_selected == 7
+
+
+@pytest.mark.parametrize("fit_intercept", [True, False])
+def test_lassolarsic_noise_variance(fit_intercept):
+    """Check the behaviour when `n_samples` < `n_features` and that one needs
+    to provide the noise variance."""
+    rng = np.random.RandomState(0)
+    X, y = datasets.make_regression(
+        n_samples=10, n_features=11 - fit_intercept, random_state=rng
+    )
+
+    model = make_pipeline(
+        StandardScaler(), LassoLarsIC(fit_intercept=fit_intercept, normalize=False)
+    )
+
+    err_msg = (
+        "You are using LassoLarsIC in the case where the number of samples is smaller"
+        " than the number of features"
+    )
+    with pytest.raises(ValueError, match=err_msg):
+        model.fit(X, y)
+
+    model.set_params(lassolarsic__noise_variance=1.0)
+    model.fit(X, y).predict(X)
diff --git a/sklearn/mixture/_gaussian_mixture.py b/sklearn/mixture/_gaussian_mixture.py
index 995366b247778..42b76e05de6ae 100644
--- a/sklearn/mixture/_gaussian_mixture.py
+++ b/sklearn/mixture/_gaussian_mixture.py
@@ -813,6 +813,9 @@ def _n_parameters(self):
     def bic(self, X):
         """Bayesian information criterion for the current model on the input X.
 
+        You can refer to this :ref:`mathematical section <aic_bic>` for more
+        details regarding the formulation of the BIC used.
+
         Parameters
         ----------
         X : array of shape (n_samples, n_dimensions)
@@ -830,6 +833,9 @@ def bic(self, X):
     def aic(self, X):
         """Akaike information criterion for the current model on the input X.
 
+        You can refer to this :ref:`mathematical section <aic_bic>` for more
+        details regarding the formulation of the AIC used.
+
         Parameters
         ----------
         X : array of shape (n_samples, n_dimensions)
diff --git a/sklearn/utils/estimator_checks.py b/sklearn/utils/estimator_checks.py
index 074c9e38b77a2..37ab3dc86d642 100644
--- a/sklearn/utils/estimator_checks.py
+++ b/sklearn/utils/estimator_checks.py
@@ -659,6 +659,11 @@ def _set_checking_parameters(estimator):
         # This is ugly :-/
         estimator.n_components = 1
 
+    if name == "LassoLarsIC":
+        # Noise variance estimation does not work when `n_samples < n_features`.
+        # We need to provide the noise variance explicitly.
+        estimator.set_params(noise_variance=1.0)
+
     if hasattr(estimator, "n_clusters"):
         estimator.n_clusters = min(estimator.n_clusters, 2)