From 60070491ca2d8c724693ddec0f0d1deb05a64540 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Tue, 30 Jul 2019 17:08:11 -0400
Subject: [PATCH 01/14] User guide for histogram based GBDTs

---
 doc/modules/ensemble.rst | 189 ++++++++++++++++++++++++++++++---------
 1 file changed, 146 insertions(+), 43 deletions(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index 42ba4c5438675..cb38c21af0277 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -440,61 +440,39 @@ Gradient Tree Boosting
 ======================
 
 `Gradient Tree Boosting <https://en.wikipedia.org/wiki/Gradient_boosting>`_
-or Gradient Boosted Regression Trees (GBRT) is a generalization
+or Gradient Boosted Degression Trees (GBDT) is a generalization
 of boosting to arbitrary
-differentiable loss functions. GBRT is an accurate and effective
+differentiable loss functions. GBDT is an accurate and effective
 off-the-shelf procedure that can be used for both regression and
-classification problems.  Gradient Tree Boosting models are used in a
+classification problems in a
 variety of areas including Web search ranking and ecology.
 
-The advantages of GBRT are:
-
-  + Natural handling of data of mixed type (= heterogeneous features)
-
-  + Predictive power
-
-  + Robustness to outliers in output space (via robust loss functions)
-
-The disadvantages of GBRT are:
-
-  + Scalability, due to the sequential nature of boosting it can
-    hardly be parallelized.
-
 The module :mod:`sklearn.ensemble` provides methods
-for both classification and regression via gradient boosted regression
+for both classification and regression via gradient boosted decision
 trees.
 
-
 .. note::
 
   Scikit-learn 0.21 introduces two new experimental implementation of
   gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
   and :class:`HistGradientBoostingRegressor`, inspired by
-  `LightGBM <https://github.com/Microsoft/LightGBM>`_. These fast estimators
-  first bin the input samples ``X`` into integer-valued bins (typically 256
-  bins) which tremendously reduces the number of splitting points to
-  consider, and allow the algorithm to leverage integer-based data
-  structures (histograms) instead of relying on sorted continuous values.
-
-  The new histogram-based estimators can be orders of magnitude faster than
-  their continuous counterparts when the number of samples is larger than
-  tens of thousands of samples. The API of these new estimators is slightly
-  different, and some of the features from :class:`GradientBoostingClassifier`
-  and :class:`GradientBoostingRegressor` are not yet supported.
-
-  These new estimators are still **experimental** for now: their predictions
-  and their API might change without any deprecation cycle. To use them, you
-  need to explicitly import ``enable_hist_gradient_boosting``::
-
-    >>> # explicitly require this experimental feature
-    >>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
-    >>> # now you can import normally from ensemble
-    >>> from sklearn.ensemble import HistGradientBoostingClassifier
+  `LightGBM <https://github.com/Microsoft/LightGBM>`_.
+
+  These new histogram-based estimators can be **orders of magnitude faster**
+  than :class:`GradientBoostingClassifier` and
+  :class:`GradientBoostingRegressor` when the number of samples is larger
+  than tens of thousands of samples.
+
+  They also have built-in support for missing values, which avoids the need
+  for an imputer. Support for categorical features is also part of the
+  roadmap.
+
+  These estimators are described in more detail below in
+  :ref:`histogram_based_gradient_boosting`.
 
   The following guide focuses on :class:`GradientBoostingClassifier` and
-  :class:`GradientBoostingRegressor` only, which might be preferred for small
-  sample sizes since binning may lead to split points that are too approximate
-  in this setting.
+  :class:`GradientBoostingRegressor`, which might be preferred for small
+  sample sizes.
 
 
 Classification
@@ -526,7 +504,8 @@ The number of weak learners (i.e. regression trees) is controlled by the paramet
    thus, the total number of induced trees equals
    ``n_classes * n_estimators``. For datasets with a large number
    of classes we strongly recommend to use
-   :class:`RandomForestClassifier` as an alternative to :class:`GradientBoostingClassifier` .
+   :class:`HistGradientBoostingClassifier` as an alternative to
+   :class:`GradientBoostingClassifier` .
 
 Regression
 ----------
@@ -838,7 +817,131 @@ accessed via the ``feature_importances_`` property::
 
  * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py`
 
- .. _voting_classifier:
+.. _histogram_based_gradient_boosting:
+
+Histogram-Based Gradient Boosting
+=================================
+
+Scikit-learn 0.21 introduces two new experimental implementation of
+gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
+and :class:`HistGradientBoostingRegressor`, inspired by
+`LightGBM <https://github.com/Microsoft/LightGBM>`_.
+
+These new histogram-based estimators can be **orders of magnitude faster**
+than :class:`GradientBoostingClassifier` and
+:class:`GradientBoostingRegressor` when the number of samples is larger
+than tens of thousands of samples.
+
+They also have built-in support for missing values, which avoids the need
+for an imputer. Support for categorical features is also part of the
+roadmap. Moreover, early-stopping is enabled by default.
+
+These fast estimators first bin the input samples ``X`` into
+integer-valued bins (typically 256 bins) which tremendously reduces the
+number of splitting points to consider, and allow the algorithm to
+leverage integer-based data structures (histograms) instead of relying on
+sorted continuous values when building the trees. The API of these new
+estimators is slightly different, and some of the features from
+:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`
+are not yet supported: in particular sample weights, and some loss
+functions.
+
+These new estimators are still **experimental** for now: their predictions
+and their API might change without any deprecation cycle. To use them, you
+need to explicitly import ``enable_hist_gradient_boosting``::
+
+  >>> # explicitly require this experimental feature
+  >>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
+  >>> # now you can import normally from ensemble
+  >>> from sklearn.ensemble import HistGradientBoostingClassifier
+
+.. topic:: Examples:
+
+ * :ref:`sphx_glr_auto_examples_inspection_plot_partial_dependence.py`
+
+Usage
+-----
+
+Most of the parameters are unchanged from
+:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`.
+One exception is the ``max_iter`` parameter that replaces ``n_estimators``.:
+
+  >>> from sklearn.experimental import enable_hist_gradient_boosting
+  >>> from sklearn.ensemble import HistGradientBoostingClassifier
+  >>> from sklearn.datasets import make_hastie_10_2
+
+  >>> X, y = make_hastie_10_2(random_state=0)
+  >>> X_train, X_test = X[:2000], X[2000:]
+  >>> y_train, y_test = y[:2000], y[2000:]
+  >>> clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train)
+
+  >>> clf.score(X_test, y_test)
+  0.8998
+
+The size of the trees can be controlled through the ``max_lead_nodees``,
+``max_depth``, and ``min_samples_leaf`` parameters.
+
+The number of bins used to bin the data is controlled with the ``max_bins``
+parameter. Using less bins acts as some sort of regularization. It is
+generally recommended to use as many bins as possible.
+
+The ``l2_regularization`` parameter is a regularizer on the loss function and
+corresponds to :math:`\lambda` in equation (2) of [XGBoost]_.
+
+Note that unlike most estimators, **early-stopping is enabled by default**.
+The early-stopping behaviour is controlled via the ``scoring``,
+``validation_fraction``, ``n_iter_no_change``, and ``tol`` parameters. It is
+possible to early-stop using an arbitrary :term:`scorer`, or just the
+training or validation loss.
+
+Missing values support
+----------------------
+
+:class:`HistGradientBoostingClassifier` and
+:class:`HistGradientBoostingRegressor` have built-in support for missing
+values (NaNs).
+
+During training, the tree grower learns at each split point whether samples
+with missing values should go to the left or right child, based on the
+potential gain. When predicting, samples with missing values are assigned to
+the left or right child consequently.
+
+
+.. TODO: Add this example when missing values PR is merged (results are
+.. wrong for now)
+
+.. from sklearn.experimental import enable_hist_gradient_boosting
+.. from sklearn.ensemble import HistGradientBoostingRegressor
+.. import numpy as np
+
+.. X = np.array([0, 1, 2, np.nan]).reshape(-1, 1)
+.. y = [0, 0, 1, 1]
+.. gbdt = HistGradientBoostingRegressor().fit(X, y)
+.. gbdt.predict(X)
+
+If no missing values were encountered for a given feature during training,
+then samples with missing values are mapped to whichever child has the most
+samples.
+
+
+Low-level parallelism
+---------------------
+
+:class:`HistGradientBoostingClassifier` and
+:class:`HistGradientBoostingRegressor` have parallel implementations that
+use OpenMP through Cython. The number of threads that is used can be changed
+using the ``OMP_NUM_THREADS`` environment variable. Please refer to the
+OpenMP documentation for details. We are planning on adding a ``n_jobs``
+parameter (or equivalent) in a future version.
+
+.. topic:: References
+
+  .. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree
+     Boosting System". https://arxiv.org/abs/1603.02754
+  .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient
+     BoostingDecision Tree"
+
+.. _voting_classifier:
 
 Voting Classifier
 ========================

From 9401fa6c1c97df7f6dcef4029abce0c6a03eb811 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Tue, 30 Jul 2019 17:14:21 -0400
Subject: [PATCH 02/14] Added backlinks to user guide in classes

---
 .../gradient_boosting.py                      | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py b/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
index dc040ed1fa409..ebd81cb77a2fc 100644
--- a/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
+++ b/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
@@ -619,13 +619,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin):
 
     This estimator is much faster than
     :class:`GradientBoostingRegressor<sklearn.ensemble.GradientBoostingRegressor>`
-    for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
-    into integer-valued bins, which considerably reduces the number of
-    splitting points to consider, and allows the algorithm to leverage
-    integer-based data structures. For small sample sizes,
-    :class:`GradientBoostingRegressor<sklearn.ensemble.GradientBoostingRegressor>`
-    might be preferred since binning may lead to split points that are too
-    approximate in this setting.
+    for big datasets (n_samples >= 10 000).
 
     This implementation is inspired by
     `LightGBM <https://github.com/Microsoft/LightGBM>`_.
@@ -641,6 +635,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin):
         >>> # now you can import normally from ensemble
         >>> from sklearn.ensemble import HistGradientBoostingClassifier
 
+    Read more in the :ref:`User Guide <histogram_based_gradient_boosting>`.
 
     Parameters
     ----------
@@ -792,13 +787,7 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting,
 
     This estimator is much faster than
     :class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
-    for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
-    into integer-valued bins, which considerably reduces the number of
-    splitting points to consider, and allows the algorithm to leverage
-    integer-based data structures. For small sample sizes,
-    :class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
-    might be preferred since binning may lead to split points that are too
-    approximate in this setting.
+    for big datasets (n_samples >= 10 000).
 
     This implementation is inspired by
     `LightGBM <https://github.com/Microsoft/LightGBM>`_.
@@ -814,6 +803,8 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting,
         >>> # now you can import normally from ensemble
         >>> from sklearn.ensemble import HistGradientBoostingClassifier
 
+    Read more in the :ref:`User Guide <histogram_based_gradient_boosting>`.
+
     Parameters
     ----------
     loss : {'auto', 'binary_crossentropy', 'categorical_crossentropy'}, \

From 42a226f8ae196c2e2be237d8a966fce562cbd436 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Tue, 30 Jul 2019 17:23:47 -0400
Subject: [PATCH 03/14] typos

---
 doc/modules/ensemble.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index cb38c21af0277..d9a77b21a8ff2 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -453,7 +453,7 @@ trees.
 
 .. note::
 
-  Scikit-learn 0.21 introduces two new experimental implementation of
+  Scikit-learn 0.21 introduces two new experimental implementations of
   gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
   and :class:`HistGradientBoostingRegressor`, inspired by
   `LightGBM <https://github.com/Microsoft/LightGBM>`_.
@@ -822,7 +822,7 @@ accessed via the ``feature_importances_`` property::
 Histogram-Based Gradient Boosting
 =================================
 
-Scikit-learn 0.21 introduces two new experimental implementation of
+Scikit-learn 0.21 introduces two new experimental implementations of
 gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
 and :class:`HistGradientBoostingRegressor`, inspired by
 `LightGBM <https://github.com/Microsoft/LightGBM>`_.

From b4556afefbe239b5713a67c3c53746b22101cdc5 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Wed, 31 Jul 2019 09:13:10 -0400
Subject: [PATCH 04/14] Roman's comments

---
 doc/modules/ensemble.rst | 41 ++++++++++++++++++++++------------------
 1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index d9a77b21a8ff2..9df25256458a8 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -440,7 +440,7 @@ Gradient Tree Boosting
 ======================
 
 `Gradient Tree Boosting <https://en.wikipedia.org/wiki/Gradient_boosting>`_
-or Gradient Boosted Degression Trees (GBDT) is a generalization
+or Gradient Boosted Decision Trees (GBDT) is a generalization
 of boosting to arbitrary
 differentiable loss functions. GBDT is an accurate and effective
 off-the-shelf procedure that can be used for both regression and
@@ -458,7 +458,7 @@ trees.
   and :class:`HistGradientBoostingRegressor`, inspired by
   `LightGBM <https://github.com/Microsoft/LightGBM>`_.
 
-  These new histogram-based estimators can be **orders of magnitude faster**
+  These histogram-based estimators can be **orders of magnitude faster**
   than :class:`GradientBoostingClassifier` and
   :class:`GradientBoostingRegressor` when the number of samples is larger
   than tens of thousands of samples.
@@ -472,7 +472,8 @@ trees.
 
   The following guide focuses on :class:`GradientBoostingClassifier` and
   :class:`GradientBoostingRegressor`, which might be preferred for small
-  sample sizes.
+  sample sizes, since binning may lead to split points that are too
+  approximate.
 
 
 Classification
@@ -827,7 +828,7 @@ gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
 and :class:`HistGradientBoostingRegressor`, inspired by
 `LightGBM <https://github.com/Microsoft/LightGBM>`_.
 
-These new histogram-based estimators can be **orders of magnitude faster**
+These histogram-based estimators can be **orders of magnitude faster**
 than :class:`GradientBoostingClassifier` and
 :class:`GradientBoostingRegressor` when the number of samples is larger
 than tens of thousands of samples.
@@ -838,15 +839,15 @@ roadmap. Moreover, early-stopping is enabled by default.
 
 These fast estimators first bin the input samples ``X`` into
 integer-valued bins (typically 256 bins) which tremendously reduces the
-number of splitting points to consider, and allow the algorithm to
+number of splitting points to consider, and allows the algorithm to
 leverage integer-based data structures (histograms) instead of relying on
-sorted continuous values when building the trees. The API of these new
+sorted continuous values when building the trees. The API of these
 estimators is slightly different, and some of the features from
 :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`
 are not yet supported: in particular sample weights, and some loss
 functions.
 
-These new estimators are still **experimental** for now: their predictions
+These estimators are still **experimental**: their predictions
 and their API might change without any deprecation cycle. To use them, you
 need to explicitly import ``enable_hist_gradient_boosting``::
 
@@ -864,7 +865,8 @@ Usage
 
 Most of the parameters are unchanged from
 :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`.
-One exception is the ``max_iter`` parameter that replaces ``n_estimators``.:
+One exception is the ``max_iter`` parameter that replaces ``n_estimators``, and
+controls the number of iterations of the boosting process:
 
   >>> from sklearn.experimental import enable_hist_gradient_boosting
   >>> from sklearn.ensemble import HistGradientBoostingClassifier
@@ -882,17 +884,19 @@ The size of the trees can be controlled through the ``max_lead_nodees``,
 ``max_depth``, and ``min_samples_leaf`` parameters.
 
 The number of bins used to bin the data is controlled with the ``max_bins``
-parameter. Using less bins acts as some sort of regularization. It is
-generally recommended to use as many bins as possible.
+parameter. Using less bins acts as a form of regularization. It is
+generally recommended to use as many bins as possible, which is the default:
+255 bins for non-missing values.
 
 The ``l2_regularization`` parameter is a regularizer on the loss function and
 corresponds to :math:`\lambda` in equation (2) of [XGBoost]_.
 
-Note that unlike most estimators, **early-stopping is enabled by default**.
-The early-stopping behaviour is controlled via the ``scoring``,
-``validation_fraction``, ``n_iter_no_change``, and ``tol`` parameters. It is
-possible to early-stop using an arbitrary :term:`scorer`, or just the
-training or validation loss.
+Note that **early-stopping is enabled by default**. The early-stopping
+behaviour is controlled via the ``scoring``, ``validation_fraction``,
+``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop
+using an arbitrary :term:`scorer`, or just the training or validation loss. By
+default, early-stopping is performed using the the default :term:`scorer` of
+the estimator on a validation set.
 
 Missing values support
 ----------------------
@@ -930,9 +934,10 @@ Low-level parallelism
 :class:`HistGradientBoostingClassifier` and
 :class:`HistGradientBoostingRegressor` have parallel implementations that
 use OpenMP through Cython. The number of threads that is used can be changed
-using the ``OMP_NUM_THREADS`` environment variable. Please refer to the
-OpenMP documentation for details. We are planning on adding a ``n_jobs``
-parameter (or equivalent) in a future version.
+using the ``OMP_NUM_THREADS`` environment variable. By default, all available
+cores are used. Please refer to the OpenMP documentation for details. We are
+planning on adding a ``n_jobs`` parameter (or equivalent) in a future
+version.
 
 .. topic:: References
 

From 44ee88fe47b1a388bb469398be4418a5395637b2 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Wed, 31 Jul 2019 09:17:36 -0400
Subject: [PATCH 05/14] reduce diff

---
 doc/modules/ensemble.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index 9df25256458a8..d00b3661cf20c 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -472,8 +472,8 @@ trees.
 
   The following guide focuses on :class:`GradientBoostingClassifier` and
   :class:`GradientBoostingRegressor`, which might be preferred for small
-  sample sizes, since binning may lead to split points that are too
-  approximate.
+  sample sizes since binning may lead to split points that are too approximate
+  in this setting.
 
 
 Classification

From 0768e6bb59d72ffd55ef79bd585e15768142a419 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Wed, 31 Jul 2019 09:21:38 -0400
Subject: [PATCH 06/14] Update doc/modules/ensemble.rst

---
 doc/modules/ensemble.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index d00b3661cf20c..582c8fede2de2 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -880,7 +880,7 @@ controls the number of iterations of the boosting process:
   >>> clf.score(X_test, y_test)
   0.8998
 
-The size of the trees can be controlled through the ``max_lead_nodees``,
+The size of the trees can be controlled through the ``max_leaf_nodes``,
 ``max_depth``, and ``min_samples_leaf`` parameters.
 
 The number of bins used to bin the data is controlled with the ``max_bins``

From f4950c62085f4cc50405e63fa1cee016c3e56a0f Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Thu, 1 Aug 2019 08:48:37 -0400
Subject: [PATCH 07/14] Addressed comments

---
 doc/modules/ensemble.rst | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index d00b3661cf20c..e8ea6406c486c 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -464,8 +464,7 @@ trees.
   than tens of thousands of samples.
 
   They also have built-in support for missing values, which avoids the need
-  for an imputer. Support for categorical features is also part of the
-  roadmap.
+  for an imputer.
 
   These estimators are described in more detail below in
   :ref:`histogram_based_gradient_boosting`.
@@ -834,8 +833,7 @@ than :class:`GradientBoostingClassifier` and
 than tens of thousands of samples.
 
 They also have built-in support for missing values, which avoids the need
-for an imputer. Support for categorical features is also part of the
-roadmap. Moreover, early-stopping is enabled by default.
+for an imputer.
 
 These fast estimators first bin the input samples ``X`` into
 integer-valued bins (typically 256 bins) which tremendously reduces the
@@ -886,7 +884,7 @@ The size of the trees can be controlled through the ``max_lead_nodees``,
 The number of bins used to bin the data is controlled with the ``max_bins``
 parameter. Using less bins acts as a form of regularization. It is
 generally recommended to use as many bins as possible, which is the default:
-255 bins for non-missing values.
+255 bins.
 
 The ``l2_regularization`` parameter is a regularizer on the loss function and
 corresponds to :math:`\lambda` in equation (2) of [XGBoost]_.
@@ -932,12 +930,11 @@ Low-level parallelism
 ---------------------
 
 :class:`HistGradientBoostingClassifier` and
-:class:`HistGradientBoostingRegressor` have parallel implementations that
-use OpenMP through Cython. The number of threads that is used can be changed
-using the ``OMP_NUM_THREADS`` environment variable. By default, all available
-cores are used. Please refer to the OpenMP documentation for details. We are
-planning on adding a ``n_jobs`` parameter (or equivalent) in a future
-version.
+:class:`HistGradientBoostingRegressor` have implementations that use OpenMP
+for parallelization through Cython. The number of threads that is used can
+be changed using the ``OMP_NUM_THREADS`` environment variable. By default,
+all available cores are used. Please refer to the OpenMP documentation for
+details.
 
 .. topic:: References
 

From d3e6ca0cf7188ebae4e013a998679e65066d2471 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Thu, 1 Aug 2019 09:05:44 -0400
Subject: [PATCH 08/14] Added why it's faster explanation

---
 doc/modules/ensemble.rst | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index e8ea6406c486c..cd9c9d4bff9b2 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -925,7 +925,6 @@ If no missing values were encountered for a given feature during training,
 then samples with missing values are mapped to whichever child has the most
 samples.
 
-
 Low-level parallelism
 ---------------------
 
@@ -943,6 +942,38 @@ details.
   .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient
      BoostingDecision Tree"
 
+Why it's faster
+---------------
+
+The bottleneck of a gradient boosting procedure is building the decision
+trees. Building a traditional decision tree (as in the other GBDTs
+:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`)
+requires sorting the feature values of all the samples at each node (for
+each feature). Sorting is needed so that the potential gain of a split point
+can be computed efficiently. Splitting a single node has thus a complexity
+of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the
+number of samples at the node.
+
+:class:`HistGradientBoostingClassifier` and
+:class:`HistGradientBoostingRegressor`, in contrast, do not require sorting the
+feature values and instead use a data-structure called a histogram, where the
+samples are implicitly ordered. Building a histogram has a
+:math:`\mathcal{O}(n)` complexity, so the node splitting procedure has a
+:math:`\mathcal{O}(\text{n_features} * n)` complexity, much smaller than the
+previous one. In addition, instead of considering :math:`n` split points, we
+here consider only ``max_bins`` split points, which is much smaller.
+
+In order to build histograms, the input data `X` needs to be binned into
+integer-valued bins. This binning procedure does require sorting the feature
+values, but it only happens once at the very beginning of the boosting process
+(not at each node, like in :class:`GradientBoostingClassifier` and
+:class:`GradientBoostingRegressor`).
+
+Finally, many parts of the implementation of
+:class:`HistGradientBoostingClassifier` and
+:class:`HistGradientBoostingRegressor` are parallelized.
+
+
 .. _voting_classifier:
 
 Voting Classifier

From 9b006dc2b172fdfe058c47a76199f5ef412e6144 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Thu, 1 Aug 2019 10:06:55 -0400
Subject: [PATCH 09/14] references at the end

---
 doc/modules/ensemble.rst | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index 628b3f6764a43..eaae7f5c1462b 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -935,13 +935,6 @@ be changed using the ``OMP_NUM_THREADS`` environment variable. By default,
 all available cores are used. Please refer to the OpenMP documentation for
 details.
 
-.. topic:: References
-
-  .. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree
-     Boosting System". https://arxiv.org/abs/1603.02754
-  .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient
-     BoostingDecision Tree"
-
 Why it's faster
 ---------------
 
@@ -973,6 +966,12 @@ Finally, many parts of the implementation of
 :class:`HistGradientBoostingClassifier` and
 :class:`HistGradientBoostingRegressor` are parallelized.
 
+.. topic:: References
+
+  .. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree
+     Boosting System". https://arxiv.org/abs/1603.02754
+  .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient
+     BoostingDecision Tree"
 
 .. _voting_classifier:
 

From 432484ee3034cb3322dbf490d416764b7d37f1d3 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Fri, 2 Aug 2019 06:44:25 -0400
Subject: [PATCH 10/14] Update doc/modules/ensemble.rst

---
 doc/modules/ensemble.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index eaae7f5c1462b..2fbe05984696d 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -941,7 +941,7 @@ Why it's faster
 The bottleneck of a gradient boosting procedure is building the decision
 trees. Building a traditional decision tree (as in the other GBDTs
 :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`)
-requires sorting the feature values of all the samples at each node (for
+requires sorting the values of all the samples at each node (for
 each feature). Sorting is needed so that the potential gain of a split point
 can be computed efficiently. Splitting a single node has thus a complexity
 of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the

From 99723e14650aeb7e31dcc1cbf62e89406a708208 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Fri, 2 Aug 2019 06:45:08 -0400
Subject: [PATCH 11/14] Update doc/modules/ensemble.rst

---
 doc/modules/ensemble.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index 2fbe05984696d..42120318abd10 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -941,7 +941,7 @@ Why it's faster
 The bottleneck of a gradient boosting procedure is building the decision
 trees. Building a traditional decision tree (as in the other GBDTs
 :class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`)
-requires sorting the values of all the samples at each node (for
+requires sorting the samples at each node (for
 each feature). Sorting is needed so that the potential gain of a split point
 can be computed efficiently. Splitting a single node has thus a complexity
 of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the

From e288542ee2ed5b48da638179b8c88df6c6fbe28e Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Fri, 2 Aug 2019 09:34:33 -0400
Subject: [PATCH 12/14] Update doc/modules/ensemble.rst

---
 doc/modules/ensemble.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index 42120318abd10..0a199a3a7433f 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -893,7 +893,7 @@ Note that **early-stopping is enabled by default**. The early-stopping
 behaviour is controlled via the ``scoring``, ``validation_fraction``,
 ``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop
 using an arbitrary :term:`scorer`, or just the training or validation loss. By
-default, early-stopping is performed using the the default :term:`scorer` of
+default, early-stopping is performed using the default :term:`scorer` of
 the estimator on a validation set.
 
 Missing values support

From d8705f6f7c092056d493a3e97856013f2a21d95e Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Fri, 2 Aug 2019 11:07:58 -0400
Subject: [PATCH 13/14] added where code is parallel

---
 doc/modules/ensemble.rst | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index eaae7f5c1462b..b38b1ff2ccf9e 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -883,8 +883,7 @@ The size of the trees can be controlled through the ``max_leaf_nodes``,
 
 The number of bins used to bin the data is controlled with the ``max_bins``
 parameter. Using less bins acts as a form of regularization. It is
-generally recommended to use as many bins as possible, which is the default:
-255 bins.
+generally recommended to use as many bins as possible, which is the default.
 
 The ``l2_regularization`` parameter is a regularizer on the loss function and
 corresponds to :math:`\lambda` in equation (2) of [XGBoost]_.
@@ -935,6 +934,17 @@ be changed using the ``OMP_NUM_THREADS`` environment variable. By default,
 all available cores are used. Please refer to the OpenMP documentation for
 details.
 
+The following parts are parallelized:
+
+- mapping samples from real values to integer-valued bins (finding the bin
+  thresholds is however sequential)
+- building histograms is parallelized over features
+- finding the best split point at a node is parallelized over features
+- during fit, mapping samples into the left and right children is
+  parallelized over samples
+- gradient and hessians computations are parallelized over samples
+- predicting is parallelized over samples
+
 Why it's faster
 ---------------
 

From 8f28dda84b6f056798af7e7f7245eaefb388d8a0 Mon Sep 17 00:00:00 2001
From: Nicolas Hug <contact@nicolas-hug.com>
Date: Mon, 5 Aug 2019 08:15:10 -0400
Subject: [PATCH 14/14] removed missing values section

---
 doc/modules/ensemble.rst | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index 8a4c223999983..e1bcf47b8ff7b 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -895,35 +895,6 @@ using an arbitrary :term:`scorer`, or just the training or validation loss. By
 default, early-stopping is performed using the default :term:`scorer` of
 the estimator on a validation set.
 
-Missing values support
-----------------------
-
-:class:`HistGradientBoostingClassifier` and
-:class:`HistGradientBoostingRegressor` have built-in support for missing
-values (NaNs).
-
-During training, the tree grower learns at each split point whether samples
-with missing values should go to the left or right child, based on the
-potential gain. When predicting, samples with missing values are assigned to
-the left or right child consequently.
-
-
-.. TODO: Add this example when missing values PR is merged (results are
-.. wrong for now)
-
-.. from sklearn.experimental import enable_hist_gradient_boosting
-.. from sklearn.ensemble import HistGradientBoostingRegressor
-.. import numpy as np
-
-.. X = np.array([0, 1, 2, np.nan]).reshape(-1, 1)
-.. y = [0, 0, 1, 1]
-.. gbdt = HistGradientBoostingRegressor().fit(X, y)
-.. gbdt.predict(X)
-
-If no missing values were encountered for a given feature during training,
-then samples with missing values are mapped to whichever child has the most
-samples.
-
 Low-level parallelism
 ---------------------