DOC Improve description of l2_regularization for hgbt models (#28652)

ArturoAmorQ · ArturoAmorQ · ogrisel · web-flow · commit 016670ea606f · 2024-04-08T20:01:32.000+02:00
Co-authored-by: ArturoAmorQ &lt;arturo.amor-quiroz@polytechnique.edu&gt;
Co-authored-by: Olivier Grisel &lt;olivier.grisel@ensta.org&gt;
Co-authored-by: Christian Lorentzen &lt;lorentzen.ch@gmail.com&gt;
diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
@@ -115,11 +115,54 @@ The size of the trees can be controlled through the ``max_leaf_nodes``,
 ``max_depth``, and ``min_samples_leaf`` parameters.
 
 The number of bins used to bin the data is controlled with the ``max_bins``
-parameter. Using less bins acts as a form of regularization. It is
-generally recommended to use as many bins as possible (256), which is the default.
+parameter. Using less bins acts as a form of regularization. It is generally
+recommended to use as many bins as possible (255), which is the default.
 
-The ``l2_regularization`` parameter is a regularizer on the loss function and
-corresponds to :math:`\lambda` in equation (2) of [XGBoost]_.
+The ``l2_regularization`` parameter acts as a regularizer for the loss function,
+and corresponds to :math:`\lambda` in the following expression (see equation (2)
+in [XGBoost]_):
+
+.. math::
+
+    \mathcal{L}(\phi) =  \sum_i l(\hat{y}_i, y_i) + \frac12 \sum_k \lambda ||w_k||^2
+
+|details-start|
+**Details on l2 regularization**:
+|details-split|
+
+It is important to notice that the loss term :math:`l(\hat{y}_i, y_i)` describes
+only half of the actual loss function except for the pinball loss and absolute
+error.
+
+The index :math:`k` refers to the k-th tree in the ensemble of trees. In the
+case of regression and binary classification, gradient boosting models grow one
+tree per iteration, then :math:`k` runs up to `max_iter`. In the case of
+multiclass classification problems, the maximal value of the index :math:`k` is
+`n_classes` :math:`\times` `max_iter`.
+
+If :math:`T_k` denotes the number of leaves in the k-th tree, then :math:`w_k`
+is a vector of length :math:`T_k`, which contains the leaf values of the form `w
+= -sum_gradient / (sum_hessian + l2_regularization)` (see equation (5) in
+[XGBoost]_).
+
+The leaf values :math:`w_k` are derived by dividing the sum of the gradients of
+the loss function by the combined sum of hessians. Adding the regularization to
+the denominator penalizes the leaves with small hessians (flat regions),
+resulting in smaller updates. Those :math:`w_k` values contribute then to the
+model's prediction for a given input that ends up in the corresponding leaf. The
+final prediction is the sum of the base prediction and the contributions from
+each tree. The result of that sum is then transformed by the inverse link
+function depending on the choice of the loss function (see
+:ref:`gradient_boosting_formulation`).
+
+Notice that the original paper [XGBoost]_ introduces a term :math:`\gamma\sum_k
+T_k` that penalizes the number of leaves (making it a smooth version of
+`max_leaf_nodes`) not presented here as it is not implemented in scikit-learn;
+whereas :math:`\lambda` penalizes the magnitude of the individual tree
+predictions before being rescaled by the learning rate, see
+:ref:`gradient_boosting_shrinkage`.
+
+|details-end|
 
 Note that **early-stopping is enabled by default if the number of samples is
 larger than 10,000**. The early-stopping behaviour is controlled via the
@@ -594,6 +637,8 @@ The parameter ``max_leaf_nodes`` corresponds to the variable ``J`` in the
 chapter on gradient boosting in [Friedman2001]_ and is related to the parameter
 ``interaction.depth`` in R's gbm package where ``max_leaf_nodes == interaction.depth + 1`` .
 
+.. _gradient_boosting_formulation:
+
 Mathematical formulation
 ^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py b/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
@@ -1483,7 +1483,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
         than a few hundred samples, it is recommended to lower this value
         since only very shallow trees would be built.
     l2_regularization : float, default=0
-        The L2 regularization parameter. Use ``0`` for no regularization (default).
+        The L2 regularization parameter penalizing leaves with small hessians.
+        Use ``0`` for no regularization (default).
     max_features : float, default=1.0
         Proportion of randomly chosen features in each and every node split.
         This is a form of regularization, smaller values make the trees weaker
@@ -1859,7 +1860,8 @@ class HistGradientBoostingClassifier(ClassifierMixin, BaseHistGradientBoosting):
         than a few hundred samples, it is recommended to lower this value
         since only very shallow trees would be built.
     l2_regularization : float, default=0
-        The L2 regularization parameter. Use ``0`` for no regularization (default).
+        The L2 regularization parameter penalizing leaves with small hessians.
+        Use ``0`` for no regularization (default).
     max_features : float, default=1.0
         Proportion of randomly chosen features in each and every node split.
         This is a form of regularization, smaller values make the trees weaker
diff --git a/sklearn/ensemble/_hist_gradient_boosting/grower.py b/sklearn/ensemble/_hist_gradient_boosting/grower.py
@@ -201,7 +201,8 @@ class TreeGrower:
     interaction_cst : list of sets of integers, default=None
         List of interaction constraints.
     l2_regularization : float, default=0.
-        The L2 regularization parameter.
+        The L2 regularization parameter penalizing leaves with small hessians.
+        Use ``0`` for no regularization (default).
     feature_fraction_per_split : float, default=1
         Proportion of randomly chosen features in each and every node split.
         This is a form of regularization, smaller values make the trees weaker