@@ -115,11 +115,54 @@ The size of the trees can be controlled through the ``max_leaf_nodes``,
115
115
``max_depth ``, and ``min_samples_leaf `` parameters.
116
116
117
117
The number of bins used to bin the data is controlled with the ``max_bins ``
118
- parameter. Using less bins acts as a form of regularization. It is
119
- generally recommended to use as many bins as possible (256 ), which is the default.
118
+ parameter. Using less bins acts as a form of regularization. It is generally
119
+ recommended to use as many bins as possible (255 ), which is the default.
120
120
121
- The ``l2_regularization `` parameter is a regularizer on the loss function and
122
- corresponds to :math: `\lambda ` in equation (2) of [XGBoost ]_.
121
+ The ``l2_regularization `` parameter acts as a regularizer for the loss function,
122
+ and corresponds to :math: `\lambda ` in the following expression (see equation (2)
123
+ in [XGBoost ]_):
124
+
125
+ .. math ::
126
+
127
+ \mathcal {L}(\phi ) = \sum _i l(\hat {y}_i, y_i) + \frac12 \sum _k \lambda ||w_k||^2
128
+
129
+ |details-start |
130
+ **Details on l2 regularization **:
131
+ |details-split |
132
+
133
+ It is important to notice that the loss term :math: `l(\hat {y}_i, y_i)` describes
134
+ only half of the actual loss function except for the pinball loss and absolute
135
+ error.
136
+
137
+ The index :math: `k` refers to the k-th tree in the ensemble of trees. In the
138
+ case of regression and binary classification, gradient boosting models grow one
139
+ tree per iteration, then :math: `k` runs up to `max_iter `. In the case of
140
+ multiclass classification problems, the maximal value of the index :math: `k` is
141
+ `n_classes ` :math: `\times ` `max_iter `.
142
+
143
+ If :math: `T_k` denotes the number of leaves in the k-th tree, then :math: `w_k`
144
+ is a vector of length :math: `T_k`, which contains the leaf values of the form `w
145
+ = -sum_gradient / (sum_hessian + l2_regularization) ` (see equation (5) in
146
+ [XGBoost ]_).
147
+
148
+ The leaf values :math: `w_k` are derived by dividing the sum of the gradients of
149
+ the loss function by the combined sum of hessians. Adding the regularization to
150
+ the denominator penalizes the leaves with small hessians (flat regions),
151
+ resulting in smaller updates. Those :math: `w_k` values contribute then to the
152
+ model's prediction for a given input that ends up in the corresponding leaf. The
153
+ final prediction is the sum of the base prediction and the contributions from
154
+ each tree. The result of that sum is then transformed by the inverse link
155
+ function depending on the choice of the loss function (see
156
+ :ref: `gradient_boosting_formulation `).
157
+
158
+ Notice that the original paper [XGBoost ]_ introduces a term :math: `\gamma \sum _k
159
+ T_k` that penalizes the number of leaves (making it a smooth version of
160
+ `max_leaf_nodes `) not presented here as it is not implemented in scikit-learn;
161
+ whereas :math: `\lambda ` penalizes the magnitude of the individual tree
162
+ predictions before being rescaled by the learning rate, see
163
+ :ref: `gradient_boosting_shrinkage `.
164
+
165
+ |details-end |
123
166
124
167
Note that **early-stopping is enabled by default if the number of samples is
125
168
larger than 10,000 **. The early-stopping behaviour is controlled via the
@@ -594,6 +637,8 @@ The parameter ``max_leaf_nodes`` corresponds to the variable ``J`` in the
594
637
chapter on gradient boosting in [Friedman2001 ]_ and is related to the parameter
595
638
``interaction.depth `` in R's gbm package where ``max_leaf_nodes == interaction.depth + 1 `` .
596
639
640
+ .. _gradient_boosting_formulation :
641
+
597
642
Mathematical formulation
598
643
^^^^^^^^^^^^^^^^^^^^^^^^
599
644
0 commit comments