-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] DOC User guide section for histogram-based GBDTs #14525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6007049
9401fa6
42a226f
b4556af
44ee88f
0768e6b
f4950c6
d3e6ca0
a3b3754
9b006dc
432484e
99723e1
e288542
d8705f6
eb29f59
31d43c7
8f28dda
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -440,59 +440,37 @@ Gradient Tree Boosting | |
====================== | ||
|
||
`Gradient Tree Boosting <https://en.wikipedia.org/wiki/Gradient_boosting>`_ | ||
or Gradient Boosted Regression Trees (GBRT) is a generalization | ||
or Gradient Boosted Decision Trees (GBDT) is a generalization | ||
of boosting to arbitrary | ||
differentiable loss functions. GBRT is an accurate and effective | ||
differentiable loss functions. GBDT is an accurate and effective | ||
off-the-shelf procedure that can be used for both regression and | ||
classification problems. Gradient Tree Boosting models are used in a | ||
classification problems in a | ||
variety of areas including Web search ranking and ecology. | ||
|
||
The advantages of GBRT are: | ||
|
||
+ Natural handling of data of mixed type (= heterogeneous features) | ||
|
||
+ Predictive power | ||
|
||
+ Robustness to outliers in output space (via robust loss functions) | ||
|
||
The disadvantages of GBRT are: | ||
|
||
+ Scalability, due to the sequential nature of boosting it can | ||
hardly be parallelized. | ||
|
||
The module :mod:`sklearn.ensemble` provides methods | ||
for both classification and regression via gradient boosted regression | ||
for both classification and regression via gradient boosted decision | ||
trees. | ||
|
||
|
||
.. note:: | ||
|
||
Scikit-learn 0.21 introduces two new experimental implementation of | ||
Scikit-learn 0.21 introduces two new experimental implementations of | ||
gradient boosting trees, namely :class:`HistGradientBoostingClassifier` | ||
and :class:`HistGradientBoostingRegressor`, inspired by | ||
`LightGBM <https://github.com/Microsoft/LightGBM>`_. These fast estimators | ||
first bin the input samples ``X`` into integer-valued bins (typically 256 | ||
bins) which tremendously reduces the number of splitting points to | ||
consider, and allow the algorithm to leverage integer-based data | ||
structures (histograms) instead of relying on sorted continuous values. | ||
|
||
The new histogram-based estimators can be orders of magnitude faster than | ||
their continuous counterparts when the number of samples is larger than | ||
tens of thousands of samples. The API of these new estimators is slightly | ||
different, and some of the features from :class:`GradientBoostingClassifier` | ||
and :class:`GradientBoostingRegressor` are not yet supported. | ||
|
||
These new estimators are still **experimental** for now: their predictions | ||
and their API might change without any deprecation cycle. To use them, you | ||
need to explicitly import ``enable_hist_gradient_boosting``:: | ||
|
||
>>> # explicitly require this experimental feature | ||
>>> from sklearn.experimental import enable_hist_gradient_boosting # noqa | ||
>>> # now you can import normally from ensemble | ||
>>> from sklearn.ensemble import HistGradientBoostingClassifier | ||
`LightGBM <https://github.com/Microsoft/LightGBM>`_. | ||
|
||
These histogram-based estimators can be **orders of magnitude faster** | ||
than :class:`GradientBoostingClassifier` and | ||
:class:`GradientBoostingRegressor` when the number of samples is larger | ||
than tens of thousands of samples. | ||
|
||
They also have built-in support for missing values, which avoids the need | ||
for an imputer. | ||
|
||
These estimators are described in more detail below in | ||
:ref:`histogram_based_gradient_boosting`. | ||
|
||
The following guide focuses on :class:`GradientBoostingClassifier` and | ||
:class:`GradientBoostingRegressor` only, which might be preferred for small | ||
:class:`GradientBoostingRegressor`, which might be preferred for small | ||
sample sizes since binning may lead to split points that are too approximate | ||
in this setting. | ||
|
||
|
@@ -526,7 +504,8 @@ The number of weak learners (i.e. regression trees) is controlled by the paramet | |
thus, the total number of induced trees equals | ||
``n_classes * n_estimators``. For datasets with a large number | ||
of classes we strongly recommend to use | ||
:class:`RandomForestClassifier` as an alternative to :class:`GradientBoostingClassifier` . | ||
:class:`HistGradientBoostingClassifier` as an alternative to | ||
:class:`GradientBoostingClassifier` . | ||
|
||
Regression | ||
---------- | ||
|
@@ -838,7 +817,144 @@ accessed via the ``feature_importances_`` property:: | |
|
||
* :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py` | ||
|
||
.. _voting_classifier: | ||
.. _histogram_based_gradient_boosting: | ||
|
||
Histogram-Based Gradient Boosting | ||
================================= | ||
|
||
Scikit-learn 0.21 introduces two new experimental implementations of | ||
gradient boosting trees, namely :class:`HistGradientBoostingClassifier` | ||
and :class:`HistGradientBoostingRegressor`, inspired by | ||
`LightGBM <https://github.com/Microsoft/LightGBM>`_. | ||
|
||
These histogram-based estimators can be **orders of magnitude faster** | ||
than :class:`GradientBoostingClassifier` and | ||
:class:`GradientBoostingRegressor` when the number of samples is larger | ||
than tens of thousands of samples. | ||
|
||
They also have built-in support for missing values, which avoids the need | ||
for an imputer. | ||
|
||
These fast estimators first bin the input samples ``X`` into | ||
integer-valued bins (typically 256 bins) which tremendously reduces the | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
number of splitting points to consider, and allows the algorithm to | ||
leverage integer-based data structures (histograms) instead of relying on | ||
sorted continuous values when building the trees. The API of these | ||
estimators is slightly different, and some of the features from | ||
:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor` | ||
are not yet supported: in particular sample weights, and some loss | ||
functions. | ||
|
||
These estimators are still **experimental**: their predictions | ||
and their API might change without any deprecation cycle. To use them, you | ||
need to explicitly import ``enable_hist_gradient_boosting``:: | ||
|
||
>>> # explicitly require this experimental feature | ||
>>> from sklearn.experimental import enable_hist_gradient_boosting # noqa | ||
>>> # now you can import normally from ensemble | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This may be a bit redundant but fine. |
||
>>> from sklearn.ensemble import HistGradientBoostingClassifier | ||
|
||
.. topic:: Examples: | ||
|
||
* :ref:`sphx_glr_auto_examples_inspection_plot_partial_dependence.py` | ||
|
||
Usage | ||
----- | ||
|
||
Most of the parameters are unchanged from | ||
:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`. | ||
One exception is the ``max_iter`` parameter that replaces ``n_estimators``, and | ||
controls the number of iterations of the boosting process: | ||
|
||
>>> from sklearn.experimental import enable_hist_gradient_boosting | ||
>>> from sklearn.ensemble import HistGradientBoostingClassifier | ||
>>> from sklearn.datasets import make_hastie_10_2 | ||
|
||
>>> X, y = make_hastie_10_2(random_state=0) | ||
>>> X_train, X_test = X[:2000], X[2000:] | ||
>>> y_train, y_test = y[:2000], y[2000:] | ||
>>> clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train) | ||
|
||
>>> clf.score(X_test, y_test) | ||
0.8998 | ||
|
||
The size of the trees can be controlled through the ``max_leaf_nodes``, | ||
``max_depth``, and ``min_samples_leaf`` parameters. | ||
|
||
The number of bins used to bin the data is controlled with the ``max_bins`` | ||
parameter. Using less bins acts as a form of regularization. It is | ||
generally recommended to use as many bins as possible, which is the default. | ||
|
||
The ``l2_regularization`` parameter is a regularizer on the loss function and | ||
corresponds to :math:`\lambda` in equation (2) of [XGBoost]_. | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Note that **early-stopping is enabled by default**. The early-stopping | ||
behaviour is controlled via the ``scoring``, ``validation_fraction``, | ||
``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop | ||
using an arbitrary :term:`scorer`, or just the training or validation loss. By | ||
default, early-stopping is performed using the default :term:`scorer` of | ||
the estimator on a validation set. | ||
|
||
Low-level parallelism | ||
--------------------- | ||
|
||
:class:`HistGradientBoostingClassifier` and | ||
:class:`HistGradientBoostingRegressor` have implementations that use OpenMP | ||
for parallelization through Cython. The number of threads that is used can | ||
be changed using the ``OMP_NUM_THREADS`` environment variable. By default, | ||
all available cores are used. Please refer to the OpenMP documentation for | ||
details. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Link to OpenMP documentation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Weird that the previous comments got lost, we discussed this with Adrin already) I don't want to add it because the openmp doc is a 300 pages pdf. Googling OMP_NUM_THREADS gives you the docs from intel and ibm but they're probably not official. |
||
|
||
The following parts are parallelized: | ||
|
||
- mapping samples from real values to integer-valued bins (finding the bin | ||
thresholds is however sequential) | ||
- building histograms is parallelized over features | ||
- finding the best split point at a node is parallelized over features | ||
- during fit, mapping samples into the left and right children is | ||
parallelized over samples | ||
- gradient and hessians computations are parallelized over samples | ||
- predicting is parallelized over samples | ||
|
||
Why it's faster | ||
--------------- | ||
|
||
The bottleneck of a gradient boosting procedure is building the decision | ||
trees. Building a traditional decision tree (as in the other GBDTs | ||
:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`) | ||
requires sorting the samples at each node (for | ||
each feature). Sorting is needed so that the potential gain of a split point | ||
can be computed efficiently. Splitting a single node has thus a complexity | ||
of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the | ||
number of samples at the node. | ||
|
||
:class:`HistGradientBoostingClassifier` and | ||
:class:`HistGradientBoostingRegressor`, in contrast, do not require sorting the | ||
feature values and instead use a data-structure called a histogram, where the | ||
samples are implicitly ordered. Building a histogram has a | ||
:math:`\mathcal{O}(n)` complexity, so the node splitting procedure has a | ||
:math:`\mathcal{O}(\text{n_features} * n)` complexity, much smaller than the | ||
previous one. In addition, instead of considering :math:`n` split points, we | ||
here consider only ``max_bins`` split points, which is much smaller. | ||
|
||
In order to build histograms, the input data `X` needs to be binned into | ||
integer-valued bins. This binning procedure does require sorting the feature | ||
values, but it only happens once at the very beginning of the boosting process | ||
(not at each node, like in :class:`GradientBoostingClassifier` and | ||
:class:`GradientBoostingRegressor`). | ||
|
||
Finally, many parts of the implementation of | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
:class:`HistGradientBoostingClassifier` and | ||
:class:`HistGradientBoostingRegressor` are parallelized. | ||
|
||
.. topic:: References | ||
|
||
.. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree | ||
Boosting System". https://arxiv.org/abs/1603.02754 | ||
.. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient | ||
BoostingDecision Tree" | ||
|
||
.. _voting_classifier: | ||
|
||
Voting Classifier | ||
======================== | ||
|
Uh oh!
There was an error while loading. Please reload this page.