Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 016670e

Browse files
ArturoAmorQArturoAmorQogrisellorentzenchr
authored
DOC Improve description of l2_regularization for hgbt models (#28652)
Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Christian Lorentzen <[email protected]>
1 parent 2f5361a commit 016670e

File tree

3 files changed

+55
-7
lines changed

3 files changed

+55
-7
lines changed

doc/modules/ensemble.rst

Lines changed: 49 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -115,11 +115,54 @@ The size of the trees can be controlled through the ``max_leaf_nodes``,
115115
``max_depth``, and ``min_samples_leaf`` parameters.
116116

117117
The number of bins used to bin the data is controlled with the ``max_bins``
118-
parameter. Using less bins acts as a form of regularization. It is
119-
generally recommended to use as many bins as possible (256), which is the default.
118+
parameter. Using less bins acts as a form of regularization. It is generally
119+
recommended to use as many bins as possible (255), which is the default.
120120

121-
The ``l2_regularization`` parameter is a regularizer on the loss function and
122-
corresponds to :math:`\lambda` in equation (2) of [XGBoost]_.
121+
The ``l2_regularization`` parameter acts as a regularizer for the loss function,
122+
and corresponds to :math:`\lambda` in the following expression (see equation (2)
123+
in [XGBoost]_):
124+
125+
.. math::
126+
127+
\mathcal{L}(\phi) = \sum_i l(\hat{y}_i, y_i) + \frac12 \sum_k \lambda ||w_k||^2
128+
129+
|details-start|
130+
**Details on l2 regularization**:
131+
|details-split|
132+
133+
It is important to notice that the loss term :math:`l(\hat{y}_i, y_i)` describes
134+
only half of the actual loss function except for the pinball loss and absolute
135+
error.
136+
137+
The index :math:`k` refers to the k-th tree in the ensemble of trees. In the
138+
case of regression and binary classification, gradient boosting models grow one
139+
tree per iteration, then :math:`k` runs up to `max_iter`. In the case of
140+
multiclass classification problems, the maximal value of the index :math:`k` is
141+
`n_classes` :math:`\times` `max_iter`.
142+
143+
If :math:`T_k` denotes the number of leaves in the k-th tree, then :math:`w_k`
144+
is a vector of length :math:`T_k`, which contains the leaf values of the form `w
145+
= -sum_gradient / (sum_hessian + l2_regularization)` (see equation (5) in
146+
[XGBoost]_).
147+
148+
The leaf values :math:`w_k` are derived by dividing the sum of the gradients of
149+
the loss function by the combined sum of hessians. Adding the regularization to
150+
the denominator penalizes the leaves with small hessians (flat regions),
151+
resulting in smaller updates. Those :math:`w_k` values contribute then to the
152+
model's prediction for a given input that ends up in the corresponding leaf. The
153+
final prediction is the sum of the base prediction and the contributions from
154+
each tree. The result of that sum is then transformed by the inverse link
155+
function depending on the choice of the loss function (see
156+
:ref:`gradient_boosting_formulation`).
157+
158+
Notice that the original paper [XGBoost]_ introduces a term :math:`\gamma\sum_k
159+
T_k` that penalizes the number of leaves (making it a smooth version of
160+
`max_leaf_nodes`) not presented here as it is not implemented in scikit-learn;
161+
whereas :math:`\lambda` penalizes the magnitude of the individual tree
162+
predictions before being rescaled by the learning rate, see
163+
:ref:`gradient_boosting_shrinkage`.
164+
165+
|details-end|
123166

124167
Note that **early-stopping is enabled by default if the number of samples is
125168
larger than 10,000**. The early-stopping behaviour is controlled via the
@@ -594,6 +637,8 @@ The parameter ``max_leaf_nodes`` corresponds to the variable ``J`` in the
594637
chapter on gradient boosting in [Friedman2001]_ and is related to the parameter
595638
``interaction.depth`` in R's gbm package where ``max_leaf_nodes == interaction.depth + 1`` .
596639

640+
.. _gradient_boosting_formulation:
641+
597642
Mathematical formulation
598643
^^^^^^^^^^^^^^^^^^^^^^^^
599644

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1483,7 +1483,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
14831483
than a few hundred samples, it is recommended to lower this value
14841484
since only very shallow trees would be built.
14851485
l2_regularization : float, default=0
1486-
The L2 regularization parameter. Use ``0`` for no regularization (default).
1486+
The L2 regularization parameter penalizing leaves with small hessians.
1487+
Use ``0`` for no regularization (default).
14871488
max_features : float, default=1.0
14881489
Proportion of randomly chosen features in each and every node split.
14891490
This is a form of regularization, smaller values make the trees weaker
@@ -1859,7 +1860,8 @@ class HistGradientBoostingClassifier(ClassifierMixin, BaseHistGradientBoosting):
18591860
than a few hundred samples, it is recommended to lower this value
18601861
since only very shallow trees would be built.
18611862
l2_regularization : float, default=0
1862-
The L2 regularization parameter. Use ``0`` for no regularization (default).
1863+
The L2 regularization parameter penalizing leaves with small hessians.
1864+
Use ``0`` for no regularization (default).
18631865
max_features : float, default=1.0
18641866
Proportion of randomly chosen features in each and every node split.
18651867
This is a form of regularization, smaller values make the trees weaker

sklearn/ensemble/_hist_gradient_boosting/grower.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,8 @@ class TreeGrower:
201201
interaction_cst : list of sets of integers, default=None
202202
List of interaction constraints.
203203
l2_regularization : float, default=0.
204-
The L2 regularization parameter.
204+
The L2 regularization parameter penalizing leaves with small hessians.
205+
Use ``0`` for no regularization (default).
205206
feature_fraction_per_split : float, default=1
206207
Proportion of randomly chosen features in each and every node split.
207208
This is a form of regularization, smaller values make the trees weaker

0 commit comments

Comments
 (0)