diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index 42ba4c5438675..e1bcf47b8ff7b 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -440,59 +440,37 @@ Gradient Tree Boosting ====================== `Gradient Tree Boosting `_ -or Gradient Boosted Regression Trees (GBRT) is a generalization +or Gradient Boosted Decision Trees (GBDT) is a generalization of boosting to arbitrary -differentiable loss functions. GBRT is an accurate and effective +differentiable loss functions. GBDT is an accurate and effective off-the-shelf procedure that can be used for both regression and -classification problems. Gradient Tree Boosting models are used in a +classification problems in a variety of areas including Web search ranking and ecology. -The advantages of GBRT are: - - + Natural handling of data of mixed type (= heterogeneous features) - - + Predictive power - - + Robustness to outliers in output space (via robust loss functions) - -The disadvantages of GBRT are: - - + Scalability, due to the sequential nature of boosting it can - hardly be parallelized. - The module :mod:`sklearn.ensemble` provides methods -for both classification and regression via gradient boosted regression +for both classification and regression via gradient boosted decision trees. - .. note:: - Scikit-learn 0.21 introduces two new experimental implementation of + Scikit-learn 0.21 introduces two new experimental implementations of gradient boosting trees, namely :class:`HistGradientBoostingClassifier` and :class:`HistGradientBoostingRegressor`, inspired by - `LightGBM `_. These fast estimators - first bin the input samples ``X`` into integer-valued bins (typically 256 - bins) which tremendously reduces the number of splitting points to - consider, and allow the algorithm to leverage integer-based data - structures (histograms) instead of relying on sorted continuous values. - - The new histogram-based estimators can be orders of magnitude faster than - their continuous counterparts when the number of samples is larger than - tens of thousands of samples. The API of these new estimators is slightly - different, and some of the features from :class:`GradientBoostingClassifier` - and :class:`GradientBoostingRegressor` are not yet supported. - - These new estimators are still **experimental** for now: their predictions - and their API might change without any deprecation cycle. To use them, you - need to explicitly import ``enable_hist_gradient_boosting``:: - - >>> # explicitly require this experimental feature - >>> from sklearn.experimental import enable_hist_gradient_boosting # noqa - >>> # now you can import normally from ensemble - >>> from sklearn.ensemble import HistGradientBoostingClassifier + `LightGBM `_. + + These histogram-based estimators can be **orders of magnitude faster** + than :class:`GradientBoostingClassifier` and + :class:`GradientBoostingRegressor` when the number of samples is larger + than tens of thousands of samples. + + They also have built-in support for missing values, which avoids the need + for an imputer. + + These estimators are described in more detail below in + :ref:`histogram_based_gradient_boosting`. The following guide focuses on :class:`GradientBoostingClassifier` and - :class:`GradientBoostingRegressor` only, which might be preferred for small + :class:`GradientBoostingRegressor`, which might be preferred for small sample sizes since binning may lead to split points that are too approximate in this setting. @@ -526,7 +504,8 @@ The number of weak learners (i.e. regression trees) is controlled by the paramet thus, the total number of induced trees equals ``n_classes * n_estimators``. For datasets with a large number of classes we strongly recommend to use - :class:`RandomForestClassifier` as an alternative to :class:`GradientBoostingClassifier` . + :class:`HistGradientBoostingClassifier` as an alternative to + :class:`GradientBoostingClassifier` . Regression ---------- @@ -838,7 +817,144 @@ accessed via the ``feature_importances_`` property:: * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py` - .. _voting_classifier: +.. _histogram_based_gradient_boosting: + +Histogram-Based Gradient Boosting +================================= + +Scikit-learn 0.21 introduces two new experimental implementations of +gradient boosting trees, namely :class:`HistGradientBoostingClassifier` +and :class:`HistGradientBoostingRegressor`, inspired by +`LightGBM `_. + +These histogram-based estimators can be **orders of magnitude faster** +than :class:`GradientBoostingClassifier` and +:class:`GradientBoostingRegressor` when the number of samples is larger +than tens of thousands of samples. + +They also have built-in support for missing values, which avoids the need +for an imputer. + +These fast estimators first bin the input samples ``X`` into +integer-valued bins (typically 256 bins) which tremendously reduces the +number of splitting points to consider, and allows the algorithm to +leverage integer-based data structures (histograms) instead of relying on +sorted continuous values when building the trees. The API of these +estimators is slightly different, and some of the features from +:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor` +are not yet supported: in particular sample weights, and some loss +functions. + +These estimators are still **experimental**: their predictions +and their API might change without any deprecation cycle. To use them, you +need to explicitly import ``enable_hist_gradient_boosting``:: + + >>> # explicitly require this experimental feature + >>> from sklearn.experimental import enable_hist_gradient_boosting # noqa + >>> # now you can import normally from ensemble + >>> from sklearn.ensemble import HistGradientBoostingClassifier + +.. topic:: Examples: + + * :ref:`sphx_glr_auto_examples_inspection_plot_partial_dependence.py` + +Usage +----- + +Most of the parameters are unchanged from +:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`. +One exception is the ``max_iter`` parameter that replaces ``n_estimators``, and +controls the number of iterations of the boosting process: + + >>> from sklearn.experimental import enable_hist_gradient_boosting + >>> from sklearn.ensemble import HistGradientBoostingClassifier + >>> from sklearn.datasets import make_hastie_10_2 + + >>> X, y = make_hastie_10_2(random_state=0) + >>> X_train, X_test = X[:2000], X[2000:] + >>> y_train, y_test = y[:2000], y[2000:] + >>> clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train) + + >>> clf.score(X_test, y_test) + 0.8998 + +The size of the trees can be controlled through the ``max_leaf_nodes``, +``max_depth``, and ``min_samples_leaf`` parameters. + +The number of bins used to bin the data is controlled with the ``max_bins`` +parameter. Using less bins acts as a form of regularization. It is +generally recommended to use as many bins as possible, which is the default. + +The ``l2_regularization`` parameter is a regularizer on the loss function and +corresponds to :math:`\lambda` in equation (2) of [XGBoost]_. + +Note that **early-stopping is enabled by default**. The early-stopping +behaviour is controlled via the ``scoring``, ``validation_fraction``, +``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop +using an arbitrary :term:`scorer`, or just the training or validation loss. By +default, early-stopping is performed using the default :term:`scorer` of +the estimator on a validation set. + +Low-level parallelism +--------------------- + +:class:`HistGradientBoostingClassifier` and +:class:`HistGradientBoostingRegressor` have implementations that use OpenMP +for parallelization through Cython. The number of threads that is used can +be changed using the ``OMP_NUM_THREADS`` environment variable. By default, +all available cores are used. Please refer to the OpenMP documentation for +details. + +The following parts are parallelized: + +- mapping samples from real values to integer-valued bins (finding the bin + thresholds is however sequential) +- building histograms is parallelized over features +- finding the best split point at a node is parallelized over features +- during fit, mapping samples into the left and right children is + parallelized over samples +- gradient and hessians computations are parallelized over samples +- predicting is parallelized over samples + +Why it's faster +--------------- + +The bottleneck of a gradient boosting procedure is building the decision +trees. Building a traditional decision tree (as in the other GBDTs +:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`) +requires sorting the samples at each node (for +each feature). Sorting is needed so that the potential gain of a split point +can be computed efficiently. Splitting a single node has thus a complexity +of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the +number of samples at the node. + +:class:`HistGradientBoostingClassifier` and +:class:`HistGradientBoostingRegressor`, in contrast, do not require sorting the +feature values and instead use a data-structure called a histogram, where the +samples are implicitly ordered. Building a histogram has a +:math:`\mathcal{O}(n)` complexity, so the node splitting procedure has a +:math:`\mathcal{O}(\text{n_features} * n)` complexity, much smaller than the +previous one. In addition, instead of considering :math:`n` split points, we +here consider only ``max_bins`` split points, which is much smaller. + +In order to build histograms, the input data `X` needs to be binned into +integer-valued bins. This binning procedure does require sorting the feature +values, but it only happens once at the very beginning of the boosting process +(not at each node, like in :class:`GradientBoostingClassifier` and +:class:`GradientBoostingRegressor`). + +Finally, many parts of the implementation of +:class:`HistGradientBoostingClassifier` and +:class:`HistGradientBoostingRegressor` are parallelized. + +.. topic:: References + + .. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree + Boosting System". https://arxiv.org/abs/1603.02754 + .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient + BoostingDecision Tree" + +.. _voting_classifier: Voting Classifier ======================== diff --git a/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py b/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py index dc040ed1fa409..ebd81cb77a2fc 100644 --- a/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py +++ b/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py @@ -619,13 +619,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin): This estimator is much faster than :class:`GradientBoostingRegressor` - for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned - into integer-valued bins, which considerably reduces the number of - splitting points to consider, and allows the algorithm to leverage - integer-based data structures. For small sample sizes, - :class:`GradientBoostingRegressor` - might be preferred since binning may lead to split points that are too - approximate in this setting. + for big datasets (n_samples >= 10 000). This implementation is inspired by `LightGBM `_. @@ -641,6 +635,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin): >>> # now you can import normally from ensemble >>> from sklearn.ensemble import HistGradientBoostingClassifier + Read more in the :ref:`User Guide `. Parameters ---------- @@ -792,13 +787,7 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting, This estimator is much faster than :class:`GradientBoostingClassifier` - for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned - into integer-valued bins, which considerably reduces the number of - splitting points to consider, and allows the algorithm to leverage - integer-based data structures. For small sample sizes, - :class:`GradientBoostingClassifier` - might be preferred since binning may lead to split points that are too - approximate in this setting. + for big datasets (n_samples >= 10 000). This implementation is inspired by `LightGBM `_. @@ -814,6 +803,8 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting, >>> # now you can import normally from ensemble >>> from sklearn.ensemble import HistGradientBoostingClassifier + Read more in the :ref:`User Guide `. + Parameters ---------- loss : {'auto', 'binary_crossentropy', 'categorical_crossentropy'}, \