From 60070491ca2d8c724693ddec0f0d1deb05a64540 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 30 Jul 2019 17:08:11 -0400 Subject: [PATCH 01/14] User guide for histogram based GBDTs --- doc/modules/ensemble.rst | 189 ++++++++++++++++++++++++++++++--------- 1 file changed, 146 insertions(+), 43 deletions(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index 42ba4c5438675..cb38c21af0277 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -440,61 +440,39 @@ Gradient Tree Boosting ====================== `Gradient Tree Boosting `_ -or Gradient Boosted Regression Trees (GBRT) is a generalization +or Gradient Boosted Degression Trees (GBDT) is a generalization of boosting to arbitrary -differentiable loss functions. GBRT is an accurate and effective +differentiable loss functions. GBDT is an accurate and effective off-the-shelf procedure that can be used for both regression and -classification problems. Gradient Tree Boosting models are used in a +classification problems in a variety of areas including Web search ranking and ecology. -The advantages of GBRT are: - - + Natural handling of data of mixed type (= heterogeneous features) - - + Predictive power - - + Robustness to outliers in output space (via robust loss functions) - -The disadvantages of GBRT are: - - + Scalability, due to the sequential nature of boosting it can - hardly be parallelized. - The module :mod:`sklearn.ensemble` provides methods -for both classification and regression via gradient boosted regression +for both classification and regression via gradient boosted decision trees. - .. note:: Scikit-learn 0.21 introduces two new experimental implementation of gradient boosting trees, namely :class:`HistGradientBoostingClassifier` and :class:`HistGradientBoostingRegressor`, inspired by - `LightGBM `_. These fast estimators - first bin the input samples ``X`` into integer-valued bins (typically 256 - bins) which tremendously reduces the number of splitting points to - consider, and allow the algorithm to leverage integer-based data - structures (histograms) instead of relying on sorted continuous values. - - The new histogram-based estimators can be orders of magnitude faster than - their continuous counterparts when the number of samples is larger than - tens of thousands of samples. The API of these new estimators is slightly - different, and some of the features from :class:`GradientBoostingClassifier` - and :class:`GradientBoostingRegressor` are not yet supported. - - These new estimators are still **experimental** for now: their predictions - and their API might change without any deprecation cycle. To use them, you - need to explicitly import ``enable_hist_gradient_boosting``:: - - >>> # explicitly require this experimental feature - >>> from sklearn.experimental import enable_hist_gradient_boosting # noqa - >>> # now you can import normally from ensemble - >>> from sklearn.ensemble import HistGradientBoostingClassifier + `LightGBM `_. + + These new histogram-based estimators can be **orders of magnitude faster** + than :class:`GradientBoostingClassifier` and + :class:`GradientBoostingRegressor` when the number of samples is larger + than tens of thousands of samples. + + They also have built-in support for missing values, which avoids the need + for an imputer. Support for categorical features is also part of the + roadmap. + + These estimators are described in more detail below in + :ref:`histogram_based_gradient_boosting`. The following guide focuses on :class:`GradientBoostingClassifier` and - :class:`GradientBoostingRegressor` only, which might be preferred for small - sample sizes since binning may lead to split points that are too approximate - in this setting. + :class:`GradientBoostingRegressor`, which might be preferred for small + sample sizes. Classification @@ -526,7 +504,8 @@ The number of weak learners (i.e. regression trees) is controlled by the paramet thus, the total number of induced trees equals ``n_classes * n_estimators``. For datasets with a large number of classes we strongly recommend to use - :class:`RandomForestClassifier` as an alternative to :class:`GradientBoostingClassifier` . + :class:`HistGradientBoostingClassifier` as an alternative to + :class:`GradientBoostingClassifier` . Regression ---------- @@ -838,7 +817,131 @@ accessed via the ``feature_importances_`` property:: * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py` - .. _voting_classifier: +.. _histogram_based_gradient_boosting: + +Histogram-Based Gradient Boosting +================================= + +Scikit-learn 0.21 introduces two new experimental implementation of +gradient boosting trees, namely :class:`HistGradientBoostingClassifier` +and :class:`HistGradientBoostingRegressor`, inspired by +`LightGBM `_. + +These new histogram-based estimators can be **orders of magnitude faster** +than :class:`GradientBoostingClassifier` and +:class:`GradientBoostingRegressor` when the number of samples is larger +than tens of thousands of samples. + +They also have built-in support for missing values, which avoids the need +for an imputer. Support for categorical features is also part of the +roadmap. Moreover, early-stopping is enabled by default. + +These fast estimators first bin the input samples ``X`` into +integer-valued bins (typically 256 bins) which tremendously reduces the +number of splitting points to consider, and allow the algorithm to +leverage integer-based data structures (histograms) instead of relying on +sorted continuous values when building the trees. The API of these new +estimators is slightly different, and some of the features from +:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor` +are not yet supported: in particular sample weights, and some loss +functions. + +These new estimators are still **experimental** for now: their predictions +and their API might change without any deprecation cycle. To use them, you +need to explicitly import ``enable_hist_gradient_boosting``:: + + >>> # explicitly require this experimental feature + >>> from sklearn.experimental import enable_hist_gradient_boosting # noqa + >>> # now you can import normally from ensemble + >>> from sklearn.ensemble import HistGradientBoostingClassifier + +.. topic:: Examples: + + * :ref:`sphx_glr_auto_examples_inspection_plot_partial_dependence.py` + +Usage +----- + +Most of the parameters are unchanged from +:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`. +One exception is the ``max_iter`` parameter that replaces ``n_estimators``.: + + >>> from sklearn.experimental import enable_hist_gradient_boosting + >>> from sklearn.ensemble import HistGradientBoostingClassifier + >>> from sklearn.datasets import make_hastie_10_2 + + >>> X, y = make_hastie_10_2(random_state=0) + >>> X_train, X_test = X[:2000], X[2000:] + >>> y_train, y_test = y[:2000], y[2000:] + >>> clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train) + + >>> clf.score(X_test, y_test) + 0.8998 + +The size of the trees can be controlled through the ``max_lead_nodees``, +``max_depth``, and ``min_samples_leaf`` parameters. + +The number of bins used to bin the data is controlled with the ``max_bins`` +parameter. Using less bins acts as some sort of regularization. It is +generally recommended to use as many bins as possible. + +The ``l2_regularization`` parameter is a regularizer on the loss function and +corresponds to :math:`\lambda` in equation (2) of [XGBoost]_. + +Note that unlike most estimators, **early-stopping is enabled by default**. +The early-stopping behaviour is controlled via the ``scoring``, +``validation_fraction``, ``n_iter_no_change``, and ``tol`` parameters. It is +possible to early-stop using an arbitrary :term:`scorer`, or just the +training or validation loss. + +Missing values support +---------------------- + +:class:`HistGradientBoostingClassifier` and +:class:`HistGradientBoostingRegressor` have built-in support for missing +values (NaNs). + +During training, the tree grower learns at each split point whether samples +with missing values should go to the left or right child, based on the +potential gain. When predicting, samples with missing values are assigned to +the left or right child consequently. + + +.. TODO: Add this example when missing values PR is merged (results are +.. wrong for now) + +.. from sklearn.experimental import enable_hist_gradient_boosting +.. from sklearn.ensemble import HistGradientBoostingRegressor +.. import numpy as np + +.. X = np.array([0, 1, 2, np.nan]).reshape(-1, 1) +.. y = [0, 0, 1, 1] +.. gbdt = HistGradientBoostingRegressor().fit(X, y) +.. gbdt.predict(X) + +If no missing values were encountered for a given feature during training, +then samples with missing values are mapped to whichever child has the most +samples. + + +Low-level parallelism +--------------------- + +:class:`HistGradientBoostingClassifier` and +:class:`HistGradientBoostingRegressor` have parallel implementations that +use OpenMP through Cython. The number of threads that is used can be changed +using the ``OMP_NUM_THREADS`` environment variable. Please refer to the +OpenMP documentation for details. We are planning on adding a ``n_jobs`` +parameter (or equivalent) in a future version. + +.. topic:: References + + .. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree + Boosting System". https://arxiv.org/abs/1603.02754 + .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient + BoostingDecision Tree" + +.. _voting_classifier: Voting Classifier ======================== From 9401fa6c1c97df7f6dcef4029abce0c6a03eb811 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 30 Jul 2019 17:14:21 -0400 Subject: [PATCH 02/14] Added backlinks to user guide in classes --- .../gradient_boosting.py | 19 +++++-------------- 1 file changed, 5 insertions(+), 14 deletions(-) diff --git a/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py b/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py index dc040ed1fa409..ebd81cb77a2fc 100644 --- a/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py +++ b/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py @@ -619,13 +619,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin): This estimator is much faster than :class:`GradientBoostingRegressor` - for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned - into integer-valued bins, which considerably reduces the number of - splitting points to consider, and allows the algorithm to leverage - integer-based data structures. For small sample sizes, - :class:`GradientBoostingRegressor` - might be preferred since binning may lead to split points that are too - approximate in this setting. + for big datasets (n_samples >= 10 000). This implementation is inspired by `LightGBM `_. @@ -641,6 +635,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin): >>> # now you can import normally from ensemble >>> from sklearn.ensemble import HistGradientBoostingClassifier + Read more in the :ref:`User Guide `. Parameters ---------- @@ -792,13 +787,7 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting, This estimator is much faster than :class:`GradientBoostingClassifier` - for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned - into integer-valued bins, which considerably reduces the number of - splitting points to consider, and allows the algorithm to leverage - integer-based data structures. For small sample sizes, - :class:`GradientBoostingClassifier` - might be preferred since binning may lead to split points that are too - approximate in this setting. + for big datasets (n_samples >= 10 000). This implementation is inspired by `LightGBM `_. @@ -814,6 +803,8 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting, >>> # now you can import normally from ensemble >>> from sklearn.ensemble import HistGradientBoostingClassifier + Read more in the :ref:`User Guide `. + Parameters ---------- loss : {'auto', 'binary_crossentropy', 'categorical_crossentropy'}, \ From 42a226f8ae196c2e2be237d8a966fce562cbd436 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 30 Jul 2019 17:23:47 -0400 Subject: [PATCH 03/14] typos --- doc/modules/ensemble.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index cb38c21af0277..d9a77b21a8ff2 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -453,7 +453,7 @@ trees. .. note:: - Scikit-learn 0.21 introduces two new experimental implementation of + Scikit-learn 0.21 introduces two new experimental implementations of gradient boosting trees, namely :class:`HistGradientBoostingClassifier` and :class:`HistGradientBoostingRegressor`, inspired by `LightGBM `_. @@ -822,7 +822,7 @@ accessed via the ``feature_importances_`` property:: Histogram-Based Gradient Boosting ================================= -Scikit-learn 0.21 introduces two new experimental implementation of +Scikit-learn 0.21 introduces two new experimental implementations of gradient boosting trees, namely :class:`HistGradientBoostingClassifier` and :class:`HistGradientBoostingRegressor`, inspired by `LightGBM `_. From b4556afefbe239b5713a67c3c53746b22101cdc5 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 31 Jul 2019 09:13:10 -0400 Subject: [PATCH 04/14] Roman's comments --- doc/modules/ensemble.rst | 41 ++++++++++++++++++++++------------------ 1 file changed, 23 insertions(+), 18 deletions(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index d9a77b21a8ff2..9df25256458a8 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -440,7 +440,7 @@ Gradient Tree Boosting ====================== `Gradient Tree Boosting `_ -or Gradient Boosted Degression Trees (GBDT) is a generalization +or Gradient Boosted Decision Trees (GBDT) is a generalization of boosting to arbitrary differentiable loss functions. GBDT is an accurate and effective off-the-shelf procedure that can be used for both regression and @@ -458,7 +458,7 @@ trees. and :class:`HistGradientBoostingRegressor`, inspired by `LightGBM `_. - These new histogram-based estimators can be **orders of magnitude faster** + These histogram-based estimators can be **orders of magnitude faster** than :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor` when the number of samples is larger than tens of thousands of samples. @@ -472,7 +472,8 @@ trees. The following guide focuses on :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`, which might be preferred for small - sample sizes. + sample sizes, since binning may lead to split points that are too + approximate. Classification @@ -827,7 +828,7 @@ gradient boosting trees, namely :class:`HistGradientBoostingClassifier` and :class:`HistGradientBoostingRegressor`, inspired by `LightGBM `_. -These new histogram-based estimators can be **orders of magnitude faster** +These histogram-based estimators can be **orders of magnitude faster** than :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor` when the number of samples is larger than tens of thousands of samples. @@ -838,15 +839,15 @@ roadmap. Moreover, early-stopping is enabled by default. These fast estimators first bin the input samples ``X`` into integer-valued bins (typically 256 bins) which tremendously reduces the -number of splitting points to consider, and allow the algorithm to +number of splitting points to consider, and allows the algorithm to leverage integer-based data structures (histograms) instead of relying on -sorted continuous values when building the trees. The API of these new +sorted continuous values when building the trees. The API of these estimators is slightly different, and some of the features from :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor` are not yet supported: in particular sample weights, and some loss functions. -These new estimators are still **experimental** for now: their predictions +These estimators are still **experimental**: their predictions and their API might change without any deprecation cycle. To use them, you need to explicitly import ``enable_hist_gradient_boosting``:: @@ -864,7 +865,8 @@ Usage Most of the parameters are unchanged from :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`. -One exception is the ``max_iter`` parameter that replaces ``n_estimators``.: +One exception is the ``max_iter`` parameter that replaces ``n_estimators``, and +controls the number of iterations of the boosting process: >>> from sklearn.experimental import enable_hist_gradient_boosting >>> from sklearn.ensemble import HistGradientBoostingClassifier @@ -882,17 +884,19 @@ The size of the trees can be controlled through the ``max_lead_nodees``, ``max_depth``, and ``min_samples_leaf`` parameters. The number of bins used to bin the data is controlled with the ``max_bins`` -parameter. Using less bins acts as some sort of regularization. It is -generally recommended to use as many bins as possible. +parameter. Using less bins acts as a form of regularization. It is +generally recommended to use as many bins as possible, which is the default: +255 bins for non-missing values. The ``l2_regularization`` parameter is a regularizer on the loss function and corresponds to :math:`\lambda` in equation (2) of [XGBoost]_. -Note that unlike most estimators, **early-stopping is enabled by default**. -The early-stopping behaviour is controlled via the ``scoring``, -``validation_fraction``, ``n_iter_no_change``, and ``tol`` parameters. It is -possible to early-stop using an arbitrary :term:`scorer`, or just the -training or validation loss. +Note that **early-stopping is enabled by default**. The early-stopping +behaviour is controlled via the ``scoring``, ``validation_fraction``, +``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop +using an arbitrary :term:`scorer`, or just the training or validation loss. By +default, early-stopping is performed using the the default :term:`scorer` of +the estimator on a validation set. Missing values support ---------------------- @@ -930,9 +934,10 @@ Low-level parallelism :class:`HistGradientBoostingClassifier` and :class:`HistGradientBoostingRegressor` have parallel implementations that use OpenMP through Cython. The number of threads that is used can be changed -using the ``OMP_NUM_THREADS`` environment variable. Please refer to the -OpenMP documentation for details. We are planning on adding a ``n_jobs`` -parameter (or equivalent) in a future version. +using the ``OMP_NUM_THREADS`` environment variable. By default, all available +cores are used. Please refer to the OpenMP documentation for details. We are +planning on adding a ``n_jobs`` parameter (or equivalent) in a future +version. .. topic:: References From 44ee88fe47b1a388bb469398be4418a5395637b2 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 31 Jul 2019 09:17:36 -0400 Subject: [PATCH 05/14] reduce diff --- doc/modules/ensemble.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index 9df25256458a8..d00b3661cf20c 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -472,8 +472,8 @@ trees. The following guide focuses on :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`, which might be preferred for small - sample sizes, since binning may lead to split points that are too - approximate. + sample sizes since binning may lead to split points that are too approximate + in this setting. Classification From 0768e6bb59d72ffd55ef79bd585e15768142a419 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 31 Jul 2019 09:21:38 -0400 Subject: [PATCH 06/14] Update doc/modules/ensemble.rst --- doc/modules/ensemble.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index d00b3661cf20c..582c8fede2de2 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -880,7 +880,7 @@ controls the number of iterations of the boosting process: >>> clf.score(X_test, y_test) 0.8998 -The size of the trees can be controlled through the ``max_lead_nodees``, +The size of the trees can be controlled through the ``max_leaf_nodes``, ``max_depth``, and ``min_samples_leaf`` parameters. The number of bins used to bin the data is controlled with the ``max_bins`` From f4950c62085f4cc50405e63fa1cee016c3e56a0f Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 1 Aug 2019 08:48:37 -0400 Subject: [PATCH 07/14] Addressed comments --- doc/modules/ensemble.rst | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index d00b3661cf20c..e8ea6406c486c 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -464,8 +464,7 @@ trees. than tens of thousands of samples. They also have built-in support for missing values, which avoids the need - for an imputer. Support for categorical features is also part of the - roadmap. + for an imputer. These estimators are described in more detail below in :ref:`histogram_based_gradient_boosting`. @@ -834,8 +833,7 @@ than :class:`GradientBoostingClassifier` and than tens of thousands of samples. They also have built-in support for missing values, which avoids the need -for an imputer. Support for categorical features is also part of the -roadmap. Moreover, early-stopping is enabled by default. +for an imputer. These fast estimators first bin the input samples ``X`` into integer-valued bins (typically 256 bins) which tremendously reduces the @@ -886,7 +884,7 @@ The size of the trees can be controlled through the ``max_lead_nodees``, The number of bins used to bin the data is controlled with the ``max_bins`` parameter. Using less bins acts as a form of regularization. It is generally recommended to use as many bins as possible, which is the default: -255 bins for non-missing values. +255 bins. The ``l2_regularization`` parameter is a regularizer on the loss function and corresponds to :math:`\lambda` in equation (2) of [XGBoost]_. @@ -932,12 +930,11 @@ Low-level parallelism --------------------- :class:`HistGradientBoostingClassifier` and -:class:`HistGradientBoostingRegressor` have parallel implementations that -use OpenMP through Cython. The number of threads that is used can be changed -using the ``OMP_NUM_THREADS`` environment variable. By default, all available -cores are used. Please refer to the OpenMP documentation for details. We are -planning on adding a ``n_jobs`` parameter (or equivalent) in a future -version. +:class:`HistGradientBoostingRegressor` have implementations that use OpenMP +for parallelization through Cython. The number of threads that is used can +be changed using the ``OMP_NUM_THREADS`` environment variable. By default, +all available cores are used. Please refer to the OpenMP documentation for +details. .. topic:: References From d3e6ca0cf7188ebae4e013a998679e65066d2471 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 1 Aug 2019 09:05:44 -0400 Subject: [PATCH 08/14] Added why it's faster explanation --- doc/modules/ensemble.rst | 33 ++++++++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index e8ea6406c486c..cd9c9d4bff9b2 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -925,7 +925,6 @@ If no missing values were encountered for a given feature during training, then samples with missing values are mapped to whichever child has the most samples. - Low-level parallelism --------------------- @@ -943,6 +942,38 @@ details. .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient BoostingDecision Tree" +Why it's faster +--------------- + +The bottleneck of a gradient boosting procedure is building the decision +trees. Building a traditional decision tree (as in the other GBDTs +:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`) +requires sorting the feature values of all the samples at each node (for +each feature). Sorting is needed so that the potential gain of a split point +can be computed efficiently. Splitting a single node has thus a complexity +of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the +number of samples at the node. + +:class:`HistGradientBoostingClassifier` and +:class:`HistGradientBoostingRegressor`, in contrast, do not require sorting the +feature values and instead use a data-structure called a histogram, where the +samples are implicitly ordered. Building a histogram has a +:math:`\mathcal{O}(n)` complexity, so the node splitting procedure has a +:math:`\mathcal{O}(\text{n_features} * n)` complexity, much smaller than the +previous one. In addition, instead of considering :math:`n` split points, we +here consider only ``max_bins`` split points, which is much smaller. + +In order to build histograms, the input data `X` needs to be binned into +integer-valued bins. This binning procedure does require sorting the feature +values, but it only happens once at the very beginning of the boosting process +(not at each node, like in :class:`GradientBoostingClassifier` and +:class:`GradientBoostingRegressor`). + +Finally, many parts of the implementation of +:class:`HistGradientBoostingClassifier` and +:class:`HistGradientBoostingRegressor` are parallelized. + + .. _voting_classifier: Voting Classifier From 9b006dc2b172fdfe058c47a76199f5ef412e6144 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 1 Aug 2019 10:06:55 -0400 Subject: [PATCH 09/14] references at the end --- doc/modules/ensemble.rst | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index 628b3f6764a43..eaae7f5c1462b 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -935,13 +935,6 @@ be changed using the ``OMP_NUM_THREADS`` environment variable. By default, all available cores are used. Please refer to the OpenMP documentation for details. -.. topic:: References - - .. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree - Boosting System". https://arxiv.org/abs/1603.02754 - .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient - BoostingDecision Tree" - Why it's faster --------------- @@ -973,6 +966,12 @@ Finally, many parts of the implementation of :class:`HistGradientBoostingClassifier` and :class:`HistGradientBoostingRegressor` are parallelized. +.. topic:: References + + .. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree + Boosting System". https://arxiv.org/abs/1603.02754 + .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient + BoostingDecision Tree" .. _voting_classifier: From 432484ee3034cb3322dbf490d416764b7d37f1d3 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 2 Aug 2019 06:44:25 -0400 Subject: [PATCH 10/14] Update doc/modules/ensemble.rst --- doc/modules/ensemble.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index eaae7f5c1462b..2fbe05984696d 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -941,7 +941,7 @@ Why it's faster The bottleneck of a gradient boosting procedure is building the decision trees. Building a traditional decision tree (as in the other GBDTs :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`) -requires sorting the feature values of all the samples at each node (for +requires sorting the values of all the samples at each node (for each feature). Sorting is needed so that the potential gain of a split point can be computed efficiently. Splitting a single node has thus a complexity of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the From 99723e14650aeb7e31dcc1cbf62e89406a708208 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 2 Aug 2019 06:45:08 -0400 Subject: [PATCH 11/14] Update doc/modules/ensemble.rst --- doc/modules/ensemble.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index 2fbe05984696d..42120318abd10 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -941,7 +941,7 @@ Why it's faster The bottleneck of a gradient boosting procedure is building the decision trees. Building a traditional decision tree (as in the other GBDTs :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`) -requires sorting the values of all the samples at each node (for +requires sorting the samples at each node (for each feature). Sorting is needed so that the potential gain of a split point can be computed efficiently. Splitting a single node has thus a complexity of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the From e288542ee2ed5b48da638179b8c88df6c6fbe28e Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 2 Aug 2019 09:34:33 -0400 Subject: [PATCH 12/14] Update doc/modules/ensemble.rst --- doc/modules/ensemble.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index 42120318abd10..0a199a3a7433f 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -893,7 +893,7 @@ Note that **early-stopping is enabled by default**. The early-stopping behaviour is controlled via the ``scoring``, ``validation_fraction``, ``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop using an arbitrary :term:`scorer`, or just the training or validation loss. By -default, early-stopping is performed using the the default :term:`scorer` of +default, early-stopping is performed using the default :term:`scorer` of the estimator on a validation set. Missing values support From d8705f6f7c092056d493a3e97856013f2a21d95e Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 2 Aug 2019 11:07:58 -0400 Subject: [PATCH 13/14] added where code is parallel --- doc/modules/ensemble.rst | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index eaae7f5c1462b..b38b1ff2ccf9e 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -883,8 +883,7 @@ The size of the trees can be controlled through the ``max_leaf_nodes``, The number of bins used to bin the data is controlled with the ``max_bins`` parameter. Using less bins acts as a form of regularization. It is -generally recommended to use as many bins as possible, which is the default: -255 bins. +generally recommended to use as many bins as possible, which is the default. The ``l2_regularization`` parameter is a regularizer on the loss function and corresponds to :math:`\lambda` in equation (2) of [XGBoost]_. @@ -935,6 +934,17 @@ be changed using the ``OMP_NUM_THREADS`` environment variable. By default, all available cores are used. Please refer to the OpenMP documentation for details. +The following parts are parallelized: + +- mapping samples from real values to integer-valued bins (finding the bin + thresholds is however sequential) +- building histograms is parallelized over features +- finding the best split point at a node is parallelized over features +- during fit, mapping samples into the left and right children is + parallelized over samples +- gradient and hessians computations are parallelized over samples +- predicting is parallelized over samples + Why it's faster --------------- From 8f28dda84b6f056798af7e7f7245eaefb388d8a0 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Mon, 5 Aug 2019 08:15:10 -0400 Subject: [PATCH 14/14] removed missing values section --- doc/modules/ensemble.rst | 29 ----------------------------- 1 file changed, 29 deletions(-) diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index 8a4c223999983..e1bcf47b8ff7b 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -895,35 +895,6 @@ using an arbitrary :term:`scorer`, or just the training or validation loss. By default, early-stopping is performed using the default :term:`scorer` of the estimator on a validation set. -Missing values support ----------------------- - -:class:`HistGradientBoostingClassifier` and -:class:`HistGradientBoostingRegressor` have built-in support for missing -values (NaNs). - -During training, the tree grower learns at each split point whether samples -with missing values should go to the left or right child, based on the -potential gain. When predicting, samples with missing values are assigned to -the left or right child consequently. - - -.. TODO: Add this example when missing values PR is merged (results are -.. wrong for now) - -.. from sklearn.experimental import enable_hist_gradient_boosting -.. from sklearn.ensemble import HistGradientBoostingRegressor -.. import numpy as np - -.. X = np.array([0, 1, 2, np.nan]).reshape(-1, 1) -.. y = [0, 0, 1, 1] -.. gbdt = HistGradientBoostingRegressor().fit(X, y) -.. gbdt.predict(X) - -If no missing values were encountered for a given feature during training, -then samples with missing values are mapped to whichever child has the most -samples. - Low-level parallelism ---------------------