diff --git a/doc/about.rst b/doc/about.rst
index e462963135b58..2ef0718b92f7e 100644
--- a/doc/about.rst
+++ b/doc/about.rst
@@ -96,44 +96,44 @@ Citing scikit-learn
 If you use scikit-learn in a scientific publication, we would appreciate
 citations to the following paper:
 
-  `Scikit-learn: Machine Learning in Python
-  <https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html>`_, Pedregosa
-  *et al.*, JMLR 12, pp. 2825-2830, 2011.
-
-  Bibtex entry::
-
-    @article{scikit-learn,
-     title={Scikit-learn: Machine Learning in {P}ython},
-     author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
-             and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
-             and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
-             Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
-     journal={Journal of Machine Learning Research},
-     volume={12},
-     pages={2825--2830},
-     year={2011}
-    }
+`Scikit-learn: Machine Learning in Python
+<https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html>`_, Pedregosa
+*et al.*, JMLR 12, pp. 2825-2830, 2011.
+
+Bibtex entry::
+
+  @article{scikit-learn,
+    title={Scikit-learn: Machine Learning in {P}ython},
+    author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+            and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+            and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+            Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+    journal={Journal of Machine Learning Research},
+    volume={12},
+    pages={2825--2830},
+    year={2011}
+  }
 
 If you want to cite scikit-learn for its API or design, you may also want to consider the
 following paper:
 
-  :arxiv:`API design for machine learning software: experiences from the scikit-learn
-  project <1309.0238>`, Buitinck *et al.*, 2013.
-
-  Bibtex entry::
-
-    @inproceedings{sklearn_api,
-      author    = {Lars Buitinck and Gilles Louppe and Mathieu Blondel and
-                   Fabian Pedregosa and Andreas Mueller and Olivier Grisel and
-                   Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort
-                   and Jaques Grobler and Robert Layton and Jake VanderPlas and
-                   Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux},
-      title     = {{API} design for machine learning software: experiences from the scikit-learn
-                   project},
-      booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine Learning},
-      year      = {2013},
-      pages = {108--122},
-    }
+:arxiv:`API design for machine learning software: experiences from the scikit-learn
+project <1309.0238>`, Buitinck *et al.*, 2013.
+
+Bibtex entry::
+
+  @inproceedings{sklearn_api,
+    author    = {Lars Buitinck and Gilles Louppe and Mathieu Blondel and
+                  Fabian Pedregosa and Andreas Mueller and Olivier Grisel and
+                  Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort
+                  and Jaques Grobler and Robert Layton and Jake VanderPlas and
+                  Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux},
+    title     = {{API} design for machine learning software: experiences from the scikit-learn
+                  project},
+    booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine Learning},
+    year      = {2013},
+    pages = {108--122},
+  }
 
 Artwork
 -------
diff --git a/doc/computing/computational_performance.rst b/doc/computing/computational_performance.rst
index dd5720630c377..d6864689502c2 100644
--- a/doc/computing/computational_performance.rst
+++ b/doc/computing/computational_performance.rst
@@ -39,10 +39,11 @@ machine learning toolkit is the latency at which predictions can be made in a
 production environment.
 
 The main factors that influence the prediction latency are
-  1. Number of features
-  2. Input data representation and sparsity
-  3. Model complexity
-  4. Feature extraction
+
+1. Number of features
+2. Input data representation and sparsity
+3. Model complexity
+4. Feature extraction
 
 A last major parameter is also the possibility to do predictions in bulk or
 one-at-a-time mode.
@@ -224,9 +225,9 @@ files, tokenizing the text and hashing it into a common vector space) is
 taking 100 to 500 times more time than the actual prediction code, depending on
 the chosen model.
 
- .. |prediction_time| image::  ../auto_examples/applications/images/sphx_glr_plot_out_of_core_classification_004.png
-    :target: ../auto_examples/applications/plot_out_of_core_classification.html
-    :scale: 80
+.. |prediction_time| image::  ../auto_examples/applications/images/sphx_glr_plot_out_of_core_classification_004.png
+  :target: ../auto_examples/applications/plot_out_of_core_classification.html
+  :scale: 80
 
 .. centered:: |prediction_time|
 
@@ -283,10 +284,11 @@ scikit-learn install with the following command::
     python -c "import sklearn; sklearn.show_versions()"
 
 Optimized BLAS / LAPACK implementations include:
- - Atlas (need hardware specific tuning by rebuilding on the target machine)
- - OpenBLAS
- - MKL
- - Apple Accelerate and vecLib frameworks (OSX only)
+
+- Atlas (need hardware specific tuning by rebuilding on the target machine)
+- OpenBLAS
+- MKL
+- Apple Accelerate and vecLib frameworks (OSX only)
 
 More information can be found on the `NumPy install page <https://numpy.org/install/>`_
 and in this
@@ -364,5 +366,5 @@ sufficient to not generate the relevant features, leaving their columns empty.
 Links
 ......
 
-  - :ref:`scikit-learn developer performance documentation <performance-howto>`
-  - `Scipy sparse matrix formats documentation <https://docs.scipy.org/doc/scipy/reference/sparse.html>`_
+- :ref:`scikit-learn developer performance documentation <performance-howto>`
+- `Scipy sparse matrix formats documentation <https://docs.scipy.org/doc/scipy/reference/sparse.html>`_
diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst
index 0cd02ab5a0449..0fcbf00cd6c04 100644
--- a/doc/computing/parallelism.rst
+++ b/doc/computing/parallelism.rst
@@ -87,15 +87,15 @@ will use as many threads as possible, i.e. as many threads as logical cores.
 
 You can control the exact number of threads that are used either:
 
- - via the ``OMP_NUM_THREADS`` environment variable, for instance when:
-   running a python script:
+- via the ``OMP_NUM_THREADS`` environment variable, for instance when:
+  running a python script:
 
-   .. prompt:: bash $
+  .. prompt:: bash $
 
-        OMP_NUM_THREADS=4 python my_script.py
+      OMP_NUM_THREADS=4 python my_script.py
 
- - or via `threadpoolctl` as explained by `this piece of documentation
-   <https://github.com/joblib/threadpoolctl/#setting-the-maximum-size-of-thread-pools>`_.
+- or via `threadpoolctl` as explained by `this piece of documentation
+  <https://github.com/joblib/threadpoolctl/#setting-the-maximum-size-of-thread-pools>`_.
 
 Parallel NumPy and SciPy routines from numerical libraries
 ..........................................................
@@ -107,15 +107,15 @@ such as MKL, OpenBLAS or BLIS.
 You can control the exact number of threads used by BLAS for each library
 using environment variables, namely:
 
-  - ``MKL_NUM_THREADS`` sets the number of thread MKL uses,
-  - ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses
-  - ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses
+- ``MKL_NUM_THREADS`` sets the number of thread MKL uses,
+- ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses
+- ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses
 
 Note that BLAS & LAPACK implementations can also be impacted by
 `OMP_NUM_THREADS`. To check whether this is the case in your environment,
 you can inspect how the number of threads effectively used by those libraries
 is affected when running the following command in a bash or zsh terminal
-for different values of `OMP_NUM_THREADS`::
+for different values of `OMP_NUM_THREADS`:
 
 .. prompt:: bash $
 
diff --git a/doc/computing/scaling_strategies.rst b/doc/computing/scaling_strategies.rst
index 277d499f4cc13..143643131b0e8 100644
--- a/doc/computing/scaling_strategies.rst
+++ b/doc/computing/scaling_strategies.rst
@@ -20,9 +20,9 @@ data that cannot fit in a computer's main memory (RAM).
 
 Here is a sketch of a system designed to achieve this goal:
 
-  1. a way to stream instances
-  2. a way to extract features from instances
-  3. an incremental algorithm
+1. a way to stream instances
+2. a way to extract features from instances
+3. an incremental algorithm
 
 Streaming instances
 ....................
@@ -62,29 +62,29 @@ balances relevancy and memory footprint could involve some tuning [1]_.
 
 Here is a list of incremental estimators for different tasks:
 
-  - Classification
-      + :class:`sklearn.naive_bayes.MultinomialNB`
-      + :class:`sklearn.naive_bayes.BernoulliNB`
-      + :class:`sklearn.linear_model.Perceptron`
-      + :class:`sklearn.linear_model.SGDClassifier`
-      + :class:`sklearn.linear_model.PassiveAggressiveClassifier`
-      + :class:`sklearn.neural_network.MLPClassifier`
-  - Regression
-      + :class:`sklearn.linear_model.SGDRegressor`
-      + :class:`sklearn.linear_model.PassiveAggressiveRegressor`
-      + :class:`sklearn.neural_network.MLPRegressor`
-  - Clustering
-      + :class:`sklearn.cluster.MiniBatchKMeans`
-      + :class:`sklearn.cluster.Birch`
-  - Decomposition / feature Extraction
-      + :class:`sklearn.decomposition.MiniBatchDictionaryLearning`
-      + :class:`sklearn.decomposition.IncrementalPCA`
-      + :class:`sklearn.decomposition.LatentDirichletAllocation`
-      + :class:`sklearn.decomposition.MiniBatchNMF`
-  - Preprocessing
-      + :class:`sklearn.preprocessing.StandardScaler`
-      + :class:`sklearn.preprocessing.MinMaxScaler`
-      + :class:`sklearn.preprocessing.MaxAbsScaler`
+- Classification
+    + :class:`sklearn.naive_bayes.MultinomialNB`
+    + :class:`sklearn.naive_bayes.BernoulliNB`
+    + :class:`sklearn.linear_model.Perceptron`
+    + :class:`sklearn.linear_model.SGDClassifier`
+    + :class:`sklearn.linear_model.PassiveAggressiveClassifier`
+    + :class:`sklearn.neural_network.MLPClassifier`
+- Regression
+    + :class:`sklearn.linear_model.SGDRegressor`
+    + :class:`sklearn.linear_model.PassiveAggressiveRegressor`
+    + :class:`sklearn.neural_network.MLPRegressor`
+- Clustering
+    + :class:`sklearn.cluster.MiniBatchKMeans`
+    + :class:`sklearn.cluster.Birch`
+- Decomposition / feature Extraction
+    + :class:`sklearn.decomposition.MiniBatchDictionaryLearning`
+    + :class:`sklearn.decomposition.IncrementalPCA`
+    + :class:`sklearn.decomposition.LatentDirichletAllocation`
+    + :class:`sklearn.decomposition.MiniBatchNMF`
+- Preprocessing
+    + :class:`sklearn.preprocessing.StandardScaler`
+    + :class:`sklearn.preprocessing.MinMaxScaler`
+    + :class:`sklearn.preprocessing.MaxAbsScaler`
 
 For classification, a somewhat important thing to note is that although a
 stateless feature extraction routine may be able to cope with new/unseen
diff --git a/doc/developers/bug_triaging.rst b/doc/developers/bug_triaging.rst
index 3ec628f7e5867..915ea0a9a22b7 100644
--- a/doc/developers/bug_triaging.rst
+++ b/doc/developers/bug_triaging.rst
@@ -19,18 +19,18 @@ A third party can give useful feedback or even add
 comments on the issue.
 The following actions are typically useful:
 
-  - documenting issues that are missing elements to reproduce the problem
-    such as code samples
+- documenting issues that are missing elements to reproduce the problem
+  such as code samples
 
-  - suggesting better use of code formatting
+- suggesting better use of code formatting
 
-  - suggesting to reformulate the title and description to make them more
-    explicit about the problem to be solved
+- suggesting to reformulate the title and description to make them more
+  explicit about the problem to be solved
 
-  - linking to related issues or discussions while briefly describing how
-    they are related, for instance "See also #xyz for a similar attempt
-    at this" or "See also #xyz where the same thing happened in
-    SomeEstimator" provides context and helps the discussion.
+- linking to related issues or discussions while briefly describing how
+  they are related, for instance "See also #xyz for a similar attempt
+  at this" or "See also #xyz where the same thing happened in
+  SomeEstimator" provides context and helps the discussion.
 
 .. topic:: Fruitful discussions
 
diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst
index 02e02eb485e8a..26f952b543a03 100644
--- a/doc/developers/contributing.rst
+++ b/doc/developers/contributing.rst
@@ -291,7 +291,7 @@ The next steps now describe the process of modifying code and submitting a PR:
 
 9. Create a feature branch to hold your development changes:
 
-    .. prompt:: bash $
+   .. prompt:: bash $
 
         git checkout -b my_feature
 
@@ -529,25 +529,25 @@ Continuous Integration (CI)
 Please note that if one of the following markers appear in the latest commit
 message, the following actions are taken.
 
-    ====================== ===================
-    Commit Message Marker  Action Taken by CI
-    ---------------------- -------------------
-    [ci skip]              CI is skipped completely
-    [cd build]             CD is run (wheels and source distribution are built)
-    [cd build gh]          CD is run only for GitHub Actions
-    [cd build cirrus]      CD is run only for Cirrus CI
-    [lint skip]            Azure pipeline skips linting
-    [scipy-dev]            Build & test with our dependencies (numpy, scipy, etc.) development builds
-    [nogil]                Build & test with the nogil experimental branches of CPython, Cython, NumPy, SciPy, ...
-    [pypy]                 Build & test with PyPy
-    [pyodide]              Build & test with Pyodide
-    [azure parallel]       Run Azure CI jobs in parallel
-    [cirrus arm]           Run Cirrus CI ARM test
-    [float32]              Run float32 tests by setting `SKLEARN_RUN_FLOAT32_TESTS=1`. See :ref:`environment_variable` for more details
-    [doc skip]             Docs are not built
-    [doc quick]            Docs built, but excludes example gallery plots
-    [doc build]            Docs built including example gallery plots (very long)
-    ====================== ===================
+====================== ===================
+Commit Message Marker  Action Taken by CI
+---------------------- -------------------
+[ci skip]              CI is skipped completely
+[cd build]             CD is run (wheels and source distribution are built)
+[cd build gh]          CD is run only for GitHub Actions
+[cd build cirrus]      CD is run only for Cirrus CI
+[lint skip]            Azure pipeline skips linting
+[scipy-dev]            Build & test with our dependencies (numpy, scipy, etc.) development builds
+[nogil]                Build & test with the nogil experimental branches of CPython, Cython, NumPy, SciPy, ...
+[pypy]                 Build & test with PyPy
+[pyodide]              Build & test with Pyodide
+[azure parallel]       Run Azure CI jobs in parallel
+[cirrus arm]           Run Cirrus CI ARM test
+[float32]              Run float32 tests by setting `SKLEARN_RUN_FLOAT32_TESTS=1`. See :ref:`environment_variable` for more details
+[doc skip]             Docs are not built
+[doc quick]            Docs built, but excludes example gallery plots
+[doc build]            Docs built including example gallery plots (very long)
+====================== ===================
 
 Note that, by default, the documentation is built but only the examples
 that are directly modified by the pull request are executed.
@@ -713,30 +713,30 @@ We are glad to accept any sort of documentation:
 
   In general have the following in mind:
 
-    * Use Python basic types. (``bool`` instead of ``boolean``)
-    * Use parenthesis for defining shapes: ``array-like of shape (n_samples,)``
-      or ``array-like of shape (n_samples, n_features)``
-    * For strings with multiple options, use brackets: ``input: {'log',
-      'squared', 'multinomial'}``
-    * 1D or 2D data can be a subset of ``{array-like, ndarray, sparse matrix,
-      dataframe}``. Note that ``array-like`` can also be a ``list``, while
-      ``ndarray`` is explicitly only a ``numpy.ndarray``.
-    * Specify ``dataframe`` when "frame-like" features are being used, such as
-      the column names.
-    * When specifying the data type of a list, use ``of`` as a delimiter: ``list
-      of int``. When the parameter supports arrays giving details about the
-      shape and/or data type and a list of such arrays, you can use one of
-      ``array-like of shape (n_samples,) or list of such arrays``.
-    * When specifying the dtype of an ndarray, use e.g. ``dtype=np.int32`` after
-      defining the shape: ``ndarray of shape (n_samples,), dtype=np.int32``. You
-      can specify multiple dtype as a set: ``array-like of shape (n_samples,),
-      dtype={np.float64, np.float32}``. If one wants to mention arbitrary
-      precision, use `integral` and `floating` rather than the Python dtype
-      `int` and `float`. When both `int` and `floating` are supported, there is
-      no need to specify the dtype.
-    * When the default is ``None``, ``None`` only needs to be specified at the
-      end with ``default=None``. Be sure to include in the docstring, what it
-      means for the parameter or attribute to be ``None``.
+  * Use Python basic types. (``bool`` instead of ``boolean``)
+  * Use parenthesis for defining shapes: ``array-like of shape (n_samples,)``
+    or ``array-like of shape (n_samples, n_features)``
+  * For strings with multiple options, use brackets: ``input: {'log',
+    'squared', 'multinomial'}``
+  * 1D or 2D data can be a subset of ``{array-like, ndarray, sparse matrix,
+    dataframe}``. Note that ``array-like`` can also be a ``list``, while
+    ``ndarray`` is explicitly only a ``numpy.ndarray``.
+  * Specify ``dataframe`` when "frame-like" features are being used, such as
+    the column names.
+  * When specifying the data type of a list, use ``of`` as a delimiter: ``list
+    of int``. When the parameter supports arrays giving details about the
+    shape and/or data type and a list of such arrays, you can use one of
+    ``array-like of shape (n_samples,) or list of such arrays``.
+  * When specifying the dtype of an ndarray, use e.g. ``dtype=np.int32`` after
+    defining the shape: ``ndarray of shape (n_samples,), dtype=np.int32``. You
+    can specify multiple dtype as a set: ``array-like of shape (n_samples,),
+    dtype={np.float64, np.float32}``. If one wants to mention arbitrary
+    precision, use `integral` and `floating` rather than the Python dtype
+    `int` and `float`. When both `int` and `floating` are supported, there is
+    no need to specify the dtype.
+  * When the default is ``None``, ``None`` only needs to be specified at the
+    end with ``default=None``. Be sure to include in the docstring, what it
+    means for the parameter or attribute to be ``None``.
 
 * Add "See Also" in docstrings for related classes/functions.
 
@@ -809,15 +809,15 @@ details, and give intuition to the reader on what the algorithm does.
 
 * Information that can be hidden by default using dropdowns is:
 
-    * low hierarchy sections such as `References`, `Properties`, etc. (see for
-      instance the subsections in :ref:`det_curve`);
+  * low hierarchy sections such as `References`, `Properties`, etc. (see for
+    instance the subsections in :ref:`det_curve`);
 
-    * in-depth mathematical details;
+  * in-depth mathematical details;
 
-    * narrative that is use-case specific;
+  * narrative that is use-case specific;
 
-    * in general, narrative that may only interest users that want to go beyond
-      the pragmatics of a given tool.
+  * in general, narrative that may only interest users that want to go beyond
+    the pragmatics of a given tool.
 
 * Do not use dropdowns for the low level section `Examples`, as it should stay
   visible to all users. Make sure that the `Examples` section comes right after
diff --git a/doc/developers/cython.rst b/doc/developers/cython.rst
index 8558169848052..e98501879d50e 100644
--- a/doc/developers/cython.rst
+++ b/doc/developers/cython.rst
@@ -58,13 +58,13 @@ Tips to ease development
 
 * You might find this alias to compile individual Cython extension handy:
 
-    .. code-block::
+  .. code-block::
 
-         # You might want to add this alias to your shell script config.
-         alias cythonX="cython -X language_level=3 -X boundscheck=False -X wraparound=False -X initializedcheck=False -X nonecheck=False -X cdivision=True"
+      # You might want to add this alias to your shell script config.
+      alias cythonX="cython -X language_level=3 -X boundscheck=False -X wraparound=False -X initializedcheck=False -X nonecheck=False -X cdivision=True"
 
-         # This generates `source.c` as if you had recompiled scikit-learn entirely.
-         cythonX --annotate source.pyx
+      # This generates `source.c` as if you had recompiled scikit-learn entirely.
+      cythonX --annotate source.pyx
 
 * Using the ``--annotate`` option with this flag allows generating a HTML report of code annotation.
   This report indicates interactions with the CPython interpreter on a line-by-line basis.
@@ -72,10 +72,10 @@ Tips to ease development
   the computationally intensive sections of the algorithms.
   For more information, please refer to `this section of Cython's tutorial <https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html#primes>`_
 
-    .. code-block::
+  .. code-block::
 
-         # This generates a HTML report (`source.html`) for `source.c`.
-         cythonX --annotate source.pyx
+      # This generates a HTML report (`source.html`) for `source.c`.
+      cythonX --annotate source.pyx
 
 Tips for performance
 ^^^^^^^^^^^^^^^^^^^^
@@ -107,16 +107,16 @@ Tips for performance
   the GIL when entering them. You have to do that yourself either by passing ``nogil=True`` to
   ``cython.parallel.prange`` explicitly, or by using an explicit context manager:
 
-    .. code-block:: cython
+  .. code-block:: cython
 
-       cdef inline void my_func(self) nogil:
+      cdef inline void my_func(self) nogil:
 
-            # Some logic interacting with CPython, e.g. allocating arrays via NumPy.
+          # Some logic interacting with CPython, e.g. allocating arrays via NumPy.
 
-            with nogil:
-                # The code here is run as is it were written in C.
+          with nogil:
+              # The code here is run as is it were written in C.
 
-            return 0
+          return 0
 
   This item is based on `this comment from Stéfan's Benhel <https://github.com/cython/cython/issues/2798#issuecomment-459971828>`_
 
diff --git a/doc/developers/maintainer.rst b/doc/developers/maintainer.rst
index d2a1d21523f78..048ad5d9906a1 100644
--- a/doc/developers/maintainer.rst
+++ b/doc/developers/maintainer.rst
@@ -81,16 +81,16 @@ tag under that branch.
 This is done only once, as the major and minor releases happen on the same
 branch:
 
-   .. prompt:: bash $
+.. prompt:: bash $
 
-     # Assuming upstream is an alias for the main scikit-learn repo:
-     git fetch upstream main
-     git checkout upstream/main
-     git checkout -b 0.99.X
-     git push --set-upstream upstream 0.99.X
+  # Assuming upstream is an alias for the main scikit-learn repo:
+  git fetch upstream main
+  git checkout upstream/main
+  git checkout -b 0.99.X
+  git push --set-upstream upstream 0.99.X
 
-   Again, `X` is literal here, and `99` is replaced by the release number.
-   The branches are called ``0.19.X``, ``0.20.X``, etc.
+Again, `X` is literal here, and `99` is replaced by the release number.
+The branches are called ``0.19.X``, ``0.20.X``, etc.
 
 In terms of including changes, the first RC ideally counts as a *feature
 freeze*. Each coming release candidate and the final release afterwards will
@@ -121,67 +121,67 @@ The minor releases should include bug fixes and some relevant documentation
 changes only. Any PR resulting in a behavior change which is not a bug fix
 should be excluded. As an example, instructions are given for the `1.2.2` release.
 
- - Create a branch, **on your own fork** (here referred to as `fork`) for the release
-   from `upstream/main`.
+- Create a branch, **on your own fork** (here referred to as `fork`) for the release
+  from `upstream/main`.
 
-    .. prompt:: bash $
+  .. prompt:: bash $
 
-        git fetch upstream/main
-        git checkout -b release-1.2.2 upstream/main
-        git push -u fork release-1.2.2:release-1.2.2
+      git fetch upstream/main
+      git checkout -b release-1.2.2 upstream/main
+      git push -u fork release-1.2.2:release-1.2.2
 
- - Create a **draft** PR to the `upstream/1.2.X` branch (not to `upstream/main`)
-   with all the desired changes.
+- Create a **draft** PR to the `upstream/1.2.X` branch (not to `upstream/main`)
+  with all the desired changes.
 
- - Do not push anything on that branch yet.
+- Do not push anything on that branch yet.
 
- - Locally rebase `release-1.2.2` from the `upstream/1.2.X` branch using:
+- Locally rebase `release-1.2.2` from the `upstream/1.2.X` branch using:
 
-    .. prompt:: bash $
+  .. prompt:: bash $
 
-        git rebase -i upstream/1.2.X
+      git rebase -i upstream/1.2.X
 
-   This will open an interactive rebase with the `git-rebase-todo` containing all
-   the latest commit on `main`. At this stage, you have to perform
-   this interactive rebase with at least someone else (being three people rebasing
-   is better not to forget something and to avoid any doubt).
+  This will open an interactive rebase with the `git-rebase-todo` containing all
+  the latest commit on `main`. At this stage, you have to perform
+  this interactive rebase with at least someone else (being three people rebasing
+  is better not to forget something and to avoid any doubt).
 
-     - **Do not remove lines but drop commit by replace** ``pick`` **with** ``drop``
+  - **Do not remove lines but drop commit by replace** ``pick`` **with** ``drop``
 
-     - Commits to pick for bug-fix release *generally* are prefixed with: `FIX`, `CI`,
-       `DOC`. They should at least include all the commits of the merged PRs
-       that were milestoned for this release on GitHub and/or documented as such in
-       the changelog. It's likely that some bugfixes were documented in the
-       changelog of the main major release instead of the next bugfix release,
-       in which case, the matching changelog entries will need to be moved,
-       first in the `main` branch then backported in the release PR.
+  - Commits to pick for bug-fix release *generally* are prefixed with: `FIX`, `CI`,
+    `DOC`. They should at least include all the commits of the merged PRs
+    that were milestoned for this release on GitHub and/or documented as such in
+    the changelog. It's likely that some bugfixes were documented in the
+    changelog of the main major release instead of the next bugfix release,
+    in which case, the matching changelog entries will need to be moved,
+    first in the `main` branch then backported in the release PR.
 
-     - Commits to drop for bug-fix release *generally* are prefixed with: `FEAT`,
-       `MAINT`, `ENH`, `API`. Reasons for not including them is to prevent change of
-       behavior (which only must feature in breaking or major releases).
+  - Commits to drop for bug-fix release *generally* are prefixed with: `FEAT`,
+    `MAINT`, `ENH`, `API`. Reasons for not including them is to prevent change of
+    behavior (which only must feature in breaking or major releases).
 
-     - After having dropped or picked commit, **do no exit** but paste the content
-       of the `git-rebase-todo` message in the PR.
-       This file is located at `.git/rebase-merge/git-rebase-todo`.
+  - After having dropped or picked commit, **do no exit** but paste the content
+    of the `git-rebase-todo` message in the PR.
+    This file is located at `.git/rebase-merge/git-rebase-todo`.
 
-     - Save and exit, starting the interactive rebase.
+  - Save and exit, starting the interactive rebase.
 
-     - Resolve merge conflicts when they happen.
+  - Resolve merge conflicts when they happen.
 
- - Force push the result of the rebase and the extra release commits to the release PR:
+- Force push the result of the rebase and the extra release commits to the release PR:
 
-   .. prompt:: bash $
+  .. prompt:: bash $
 
-       git push -f fork release-1.2.2:release-1.2.2
+      git push -f fork release-1.2.2:release-1.2.2
 
- - Copy the :ref:`release_checklist` template and paste it in the description of the
-   Pull Request to track progress.
+- Copy the :ref:`release_checklist` template and paste it in the description of the
+  Pull Request to track progress.
 
- - Review all the commits included in the release to make sure that they do not
-   introduce any new feature. We should not blindly trust the commit message prefixes.
+- Review all the commits included in the release to make sure that they do not
+  introduce any new feature. We should not blindly trust the commit message prefixes.
 
- - Remove the draft status of the release PR and invite other maintainers to review the
-   list of included commits.
+- Remove the draft status of the release PR and invite other maintainers to review the
+  list of included commits.
 
 .. _making_a_release:
 
diff --git a/doc/developers/minimal_reproducer.rst b/doc/developers/minimal_reproducer.rst
index 2cc82d083aaf1..b100bccbaa6b4 100644
--- a/doc/developers/minimal_reproducer.rst
+++ b/doc/developers/minimal_reproducer.rst
@@ -88,9 +88,9 @@ The following code, while **still not minimal**, is already **much better**
 because it can be copy-pasted in a Python terminal to reproduce the problem in
 one step. In particular:
 
-    - it contains **all necessary imports statements**;
-    - it can fetch the public dataset without having to manually download a
-      file and put it in the expected location on the disk.
+- it contains **all necessary imports statements**;
+- it can fetch the public dataset without having to manually download a
+  file and put it in the expected location on the disk.
 
 **Improved example**
 
@@ -199,21 +199,21 @@ As already mentioned, the key to communication is the readability of the code
 and good formatting can really be a plus. Notice that in the previous snippet
 we:
 
-    - try to limit all lines to a maximum of 79 characters to avoid horizontal
-      scrollbars in the code snippets blocks rendered on the GitHub issue;
-    - use blank lines to separate groups of related functions;
-    - place all the imports in their own group at the beginning.
+- try to limit all lines to a maximum of 79 characters to avoid horizontal
+  scrollbars in the code snippets blocks rendered on the GitHub issue;
+- use blank lines to separate groups of related functions;
+- place all the imports in their own group at the beginning.
 
 The simplification steps presented in this guide can be implemented in a
 different order than the progression we have shown here. The important points
 are:
 
-    - a minimal reproducer should be runnable by a simple copy-and-paste in a
-      python terminal;
-    - it should be simplified as much as possible by removing any code steps
-      that are not strictly needed to reproducing the original problem;
-    - it should ideally only rely on a minimal dataset generated on-the-fly by
-      running the code instead of relying on external data, if possible.
+- a minimal reproducer should be runnable by a simple copy-and-paste in a
+  python terminal;
+- it should be simplified as much as possible by removing any code steps
+  that are not strictly needed to reproducing the original problem;
+- it should ideally only rely on a minimal dataset generated on-the-fly by
+  running the code instead of relying on external data, if possible.
 
 
 Use markdown formatting
@@ -305,50 +305,50 @@ can be used to create dummy numeric data.
 
 - regression
 
-    Regressions take continuous numeric data as features and target.
+  Regressions take continuous numeric data as features and target.
 
-    .. code-block:: python
+  .. code-block:: python
 
-        import numpy as np
+      import numpy as np
 
-        rng = np.random.RandomState(0)
-        n_samples, n_features = 5, 5
-        X = rng.randn(n_samples, n_features)
-        y = rng.randn(n_samples)
+      rng = np.random.RandomState(0)
+      n_samples, n_features = 5, 5
+      X = rng.randn(n_samples, n_features)
+      y = rng.randn(n_samples)
 
 A similar snippet can be used as synthetic data when testing scaling tools such
 as :class:`sklearn.preprocessing.StandardScaler`.
 
 - classification
 
-    If the bug is not raised during when encoding a categorical variable, you can
-    feed numeric data to a classifier. Just remember to ensure that the target
-    is indeed an integer.
+  If the bug is not raised during when encoding a categorical variable, you can
+  feed numeric data to a classifier. Just remember to ensure that the target
+  is indeed an integer.
 
-    .. code-block:: python
+  .. code-block:: python
 
-        import numpy as np
+      import numpy as np
 
-        rng = np.random.RandomState(0)
-        n_samples, n_features = 5, 5
-        X = rng.randn(n_samples, n_features)
-        y = rng.randint(0, 2, n_samples)  # binary target with values in {0, 1}
+      rng = np.random.RandomState(0)
+      n_samples, n_features = 5, 5
+      X = rng.randn(n_samples, n_features)
+      y = rng.randint(0, 2, n_samples)  # binary target with values in {0, 1}
 
 
-    If the bug only happens with non-numeric class labels, you might want to
-    generate a random target with `numpy.random.choice
-    <https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html>`_.
+  If the bug only happens with non-numeric class labels, you might want to
+  generate a random target with `numpy.random.choice
+  <https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html>`_.
 
-    .. code-block:: python
+  .. code-block:: python
 
-        import numpy as np
+      import numpy as np
 
-        rng = np.random.RandomState(0)
-        n_samples, n_features = 50, 5
-        X = rng.randn(n_samples, n_features)
-        y = np.random.choice(
-            ["male", "female", "other"], size=n_samples, p=[0.49, 0.49, 0.02]
-        )
+      rng = np.random.RandomState(0)
+      n_samples, n_features = 50, 5
+      X = rng.randn(n_samples, n_features)
+      y = np.random.choice(
+          ["male", "female", "other"], size=n_samples, p=[0.49, 0.49, 0.02]
+      )
 
 Pandas
 ------
diff --git a/doc/developers/performance.rst b/doc/developers/performance.rst
index 287262255535f..42687945a2bba 100644
--- a/doc/developers/performance.rst
+++ b/doc/developers/performance.rst
@@ -46,31 +46,31 @@ Sometimes however an algorithm cannot be expressed efficiently in simple
 vectorized Numpy code. In this case, the recommended strategy is the
 following:
 
-  1. **Profile** the Python implementation to find the main bottleneck and
-     isolate it in a **dedicated module level function**. This function
-     will be reimplemented as a compiled extension module.
-
-  2. If there exists a well maintained BSD or MIT **C/C++** implementation
-     of the same algorithm that is not too big, you can write a
-     **Cython wrapper** for it and include a copy of the source code
-     of the library in the scikit-learn source tree: this strategy is
-     used for the classes :class:`svm.LinearSVC`, :class:`svm.SVC` and
-     :class:`linear_model.LogisticRegression` (wrappers for liblinear
-     and libsvm).
-
-  3. Otherwise, write an optimized version of your Python function using
-     **Cython** directly. This strategy is used
-     for the :class:`linear_model.ElasticNet` and
-     :class:`linear_model.SGDClassifier` classes for instance.
-
-  4. **Move the Python version of the function in the tests** and use
-     it to check that the results of the compiled extension are consistent
-     with the gold standard, easy to debug Python version.
-
-  5. Once the code is optimized (not simple bottleneck spottable by
-     profiling), check whether it is possible to have **coarse grained
-     parallelism** that is amenable to **multi-processing** by using the
-     ``joblib.Parallel`` class.
+1. **Profile** the Python implementation to find the main bottleneck and
+   isolate it in a **dedicated module level function**. This function
+   will be reimplemented as a compiled extension module.
+
+2. If there exists a well maintained BSD or MIT **C/C++** implementation
+   of the same algorithm that is not too big, you can write a
+   **Cython wrapper** for it and include a copy of the source code
+   of the library in the scikit-learn source tree: this strategy is
+   used for the classes :class:`svm.LinearSVC`, :class:`svm.SVC` and
+   :class:`linear_model.LogisticRegression` (wrappers for liblinear
+   and libsvm).
+
+3. Otherwise, write an optimized version of your Python function using
+   **Cython** directly. This strategy is used
+   for the :class:`linear_model.ElasticNet` and
+   :class:`linear_model.SGDClassifier` classes for instance.
+
+4. **Move the Python version of the function in the tests** and use
+   it to check that the results of the compiled extension are consistent
+   with the gold standard, easy to debug Python version.
+
+5. Once the code is optimized (not simple bottleneck spottable by
+   profiling), check whether it is possible to have **coarse grained
+   parallelism** that is amenable to **multi-processing** by using the
+   ``joblib.Parallel`` class.
 
 When using Cython, use either
 
@@ -187,7 +187,7 @@ us install ``line_profiler`` and wire it to IPython:
 
   pip install line_profiler
 
-- **Under IPython 0.13+**, first create a configuration profile:
+**Under IPython 0.13+**, first create a configuration profile:
 
 .. prompt:: bash $
 
@@ -265,7 +265,7 @@ install the latest version:
 
 Then, setup the magics in a manner similar to ``line_profiler``.
 
-- **Under IPython 0.11+**, first create a configuration profile:
+**Under IPython 0.11+**, first create a configuration profile:
 
 .. prompt:: bash $
 
diff --git a/doc/developers/tips.rst b/doc/developers/tips.rst
index 3d42626126f8a..f8537236c32d8 100644
--- a/doc/developers/tips.rst
+++ b/doc/developers/tips.rst
@@ -73,27 +73,25 @@ will run all :term:`common tests` for the ``LogisticRegression`` estimator.
 
 When a unit test fails, the following tricks can make debugging easier:
 
-  1. The command line argument ``pytest -l`` instructs pytest to print the local
-     variables when a failure occurs.
+1. The command line argument ``pytest -l`` instructs pytest to print the local
+   variables when a failure occurs.
 
-  2. The argument ``pytest --pdb`` drops into the Python debugger on failure. To
-     instead drop into the rich IPython debugger ``ipdb``, you may set up a
-     shell alias to:
+2. The argument ``pytest --pdb`` drops into the Python debugger on failure. To
+   instead drop into the rich IPython debugger ``ipdb``, you may set up a
+   shell alias to:
 
-.. prompt:: bash $
+   .. prompt:: bash $
 
-    pytest --pdbcls=IPython.terminal.debugger:TerminalPdb --capture no
+      pytest --pdbcls=IPython.terminal.debugger:TerminalPdb --capture no
 
 Other `pytest` options that may become useful include:
 
-  - ``-x`` which exits on the first failed test
-  - ``--lf`` to rerun the tests that failed on the previous run
-  - ``--ff`` to rerun all previous tests, running the ones that failed first
-  - ``-s`` so that pytest does not capture the output of ``print()``
-    statements
-  - ``--tb=short`` or ``--tb=line`` to control the length of the logs
-  - ``--runxfail`` also run tests marked as a known failure (XFAIL) and report
-    errors.
+- ``-x`` which exits on the first failed test,
+- ``--lf`` to rerun the tests that failed on the previous run,
+- ``--ff`` to rerun all previous tests, running the ones that failed first,
+- ``-s`` so that pytest does not capture the output of ``print()`` statements,
+- ``--tb=short`` or ``--tb=line`` to control the length of the logs,
+- ``--runxfail`` also run tests marked as a known failure (XFAIL) and report errors.
 
 Since our continuous integration tests will error if
 ``FutureWarning`` isn't properly caught,
@@ -114,113 +112,135 @@ replies <https://github.com/settings/replies/>`_ for reviewing:
     Note that putting this content on a single line in a literal is the easiest way to make it copyable and wrapped on screen.
 
 Issue: Usage questions
-    ::
 
-        You are asking a usage question. The issue tracker is for bugs and new features. For usage questions, it is recommended to try [Stack Overflow](https://stackoverflow.com/questions/tagged/scikit-learn) or [the Mailing List](https://mail.python.org/mailman/listinfo/scikit-learn).
+::
+
+    You are asking a usage question. The issue tracker is for bugs and new features. For usage questions, it is recommended to try [Stack Overflow](https://stackoverflow.com/questions/tagged/scikit-learn) or [the Mailing List](https://mail.python.org/mailman/listinfo/scikit-learn).
 
-        Unfortunately, we need to close this issue as this issue tracker is a communication tool used for the development of scikit-learn. The additional activity created by usage questions crowds it too much and impedes this development. The conversation can continue here, however there is no guarantee that is will receive attention from core developers.
+    Unfortunately, we need to close this issue as this issue tracker is a communication tool used for the development of scikit-learn. The additional activity created by usage questions crowds it too much and impedes this development. The conversation can continue here, however there is no guarantee that is will receive attention from core developers.
 
 
 Issue: You're welcome to update the docs
-    ::
 
-        Please feel free to offer a pull request updating the documentation if you feel it could be improved.
+::
+
+    Please feel free to offer a pull request updating the documentation if you feel it could be improved.
 
 Issue: Self-contained example for bug
-    ::
 
-        Please provide [self-contained example code](https://scikit-learn.org/dev/developers/minimal_reproducer.html), including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.
+::
+
+    Please provide [self-contained example code](https://scikit-learn.org/dev/developers/minimal_reproducer.html), including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.
 
 Issue: Software versions
-    ::
 
-        To help diagnose your issue, please paste the output of:
-        ```py
-        import sklearn; sklearn.show_versions()
-        ```
-        Thanks.
+::
+
+    To help diagnose your issue, please paste the output of:
+    ```py
+    import sklearn; sklearn.show_versions()
+    ```
+    Thanks.
 
 Issue: Code blocks
-    ::
 
-        Readability can be greatly improved if you [format](https://help.github.com/articles/creating-and-highlighting-code-blocks/) your code snippets and complete error messages appropriately. For example:
+::
+
+    Readability can be greatly improved if you [format](https://help.github.com/articles/creating-and-highlighting-code-blocks/) your code snippets and complete error messages appropriately. For example:
 
-            ```python
-            print(something)
-            ```
-        generates:
         ```python
         print(something)
         ```
-        And:
-
-            ```pytb
-            Traceback (most recent call last):
-              File "<stdin>", line 1, in <module>
-            ImportError: No module named 'hello'
-            ```
-        generates:
+
+    generates:
+
+    ```python
+    print(something)
+    ```
+
+    And:
+
         ```pytb
         Traceback (most recent call last):
-          File "<stdin>", line 1, in <module>
+            File "<stdin>", line 1, in <module>
         ImportError: No module named 'hello'
         ```
-        You can edit your issue descriptions and comments at any time to improve readability. This helps maintainers a lot. Thanks!
+
+    generates:
+
+    ```pytb
+    Traceback (most recent call last):
+        File "<stdin>", line 1, in <module>
+    ImportError: No module named 'hello'
+    ```
+
+    You can edit your issue descriptions and comments at any time to improve readability. This helps maintainers a lot. Thanks!
 
 Issue/Comment: Linking to code
-    ::
 
-        Friendly advice: for clarity's sake, you can link to code like [this](https://help.github.com/articles/creating-a-permanent-link-to-a-code-snippet/).
+::
+
+    Friendly advice: for clarity's sake, you can link to code like [this](https://help.github.com/articles/creating-a-permanent-link-to-a-code-snippet/).
 
 Issue/Comment: Linking to comments
-    ::
 
-        Please use links to comments, which make it a lot easier to see what you are referring to, rather than just linking to the issue. See [this](https://stackoverflow.com/questions/25163598/how-do-i-reference-a-specific-issue-comment-on-github) for more details.
+::
+
+    Please use links to comments, which make it a lot easier to see what you are referring to, rather than just linking to the issue. See [this](https://stackoverflow.com/questions/25163598/how-do-i-reference-a-specific-issue-comment-on-github) for more details.
 
 PR-NEW: Better description and title
-    ::
 
-        Thanks for the pull request! Please make the title of the PR more descriptive. The title will become the commit message when this is merged. You should state what issue (or PR) it fixes/resolves in the description using the syntax described [here](https://scikit-learn.org/dev/developers/contributing.html#contributing-pull-requests).
+::
+
+    Thanks for the pull request! Please make the title of the PR more descriptive. The title will become the commit message when this is merged. You should state what issue (or PR) it fixes/resolves in the description using the syntax described [here](https://scikit-learn.org/dev/developers/contributing.html#contributing-pull-requests).
 
 PR-NEW: Fix #
-    ::
 
-        Please use "Fix #issueNumber" in your PR description (and you can do it more than once). This way the associated issue gets closed automatically when the PR is merged. For more details, look at [this](https://github.com/blog/1506-closing-issues-via-pull-requests).
+::
+
+    Please use "Fix #issueNumber" in your PR description (and you can do it more than once). This way the associated issue gets closed automatically when the PR is merged. For more details, look at [this](https://github.com/blog/1506-closing-issues-via-pull-requests).
 
 PR-NEW or Issue: Maintenance cost
-    ::
 
-        Every feature we include has a [maintenance cost](https://scikit-learn.org/dev/faq.html#why-are-you-so-selective-on-what-algorithms-you-include-in-scikit-learn). Our maintainers are mostly volunteers. For a new feature to be included, we need evidence that it is often useful and, ideally, [well-established](https://scikit-learn.org/dev/faq.html#what-are-the-inclusion-criteria-for-new-algorithms) in the literature or in practice. Also, we expect PR authors to take part in the maintenance for the code they submit, at least initially. That doesn't stop you implementing it for yourself and publishing it in a separate repository, or even [scikit-learn-contrib](https://scikit-learn-contrib.github.io).
+::
+
+    Every feature we include has a [maintenance cost](https://scikit-learn.org/dev/faq.html#why-are-you-so-selective-on-what-algorithms-you-include-in-scikit-learn). Our maintainers are mostly volunteers. For a new feature to be included, we need evidence that it is often useful and, ideally, [well-established](https://scikit-learn.org/dev/faq.html#what-are-the-inclusion-criteria-for-new-algorithms) in the literature or in practice. Also, we expect PR authors to take part in the maintenance for the code they submit, at least initially. That doesn't stop you implementing it for yourself and publishing it in a separate repository, or even [scikit-learn-contrib](https://scikit-learn-contrib.github.io).
 
 PR-WIP: What's needed before merge?
-    ::
 
-        Please clarify (perhaps as a TODO list in the PR description) what work you believe still needs to be done before it can be reviewed for merge. When it is ready, please prefix the PR title with `[MRG]`.
+::
+
+    Please clarify (perhaps as a TODO list in the PR description) what work you believe still needs to be done before it can be reviewed for merge. When it is ready, please prefix the PR title with `[MRG]`.
 
 PR-WIP: Regression test needed
-    ::
 
-        Please add a [non-regression test](https://en.wikipedia.org/wiki/Non-regression_testing) that would fail at main but pass in this PR.
+::
+
+    Please add a [non-regression test](https://en.wikipedia.org/wiki/Non-regression_testing) that would fail at main but pass in this PR.
 
 PR-WIP: PEP8
-    ::
 
-        You have some [PEP8](https://www.python.org/dev/peps/pep-0008/) violations, whose details you can see in the Circle CI `lint` job. It might be worth configuring your code editor to check for such errors on the fly, so you can catch them before committing.
+::
+
+    You have some [PEP8](https://www.python.org/dev/peps/pep-0008/) violations, whose details you can see in the Circle CI `lint` job. It might be worth configuring your code editor to check for such errors on the fly, so you can catch them before committing.
 
 PR-MRG: Patience
-    ::
 
-        Before merging, we generally require two core developers to agree that your pull request is desirable and ready. [Please be patient](https://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention), as we mostly rely on volunteered time from busy core developers. (You are also welcome to help us out with [reviewing other PRs](https://scikit-learn.org/dev/developers/contributing.html#code-review-guidelines).)
+::
+
+    Before merging, we generally require two core developers to agree that your pull request is desirable and ready. [Please be patient](https://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention), as we mostly rely on volunteered time from busy core developers. (You are also welcome to help us out with [reviewing other PRs](https://scikit-learn.org/dev/developers/contributing.html#code-review-guidelines).)
 
 PR-MRG: Add to what's new
-    ::
 
-        Please add an entry to the change log at `doc/whats_new/v*.rst`. Like the other entries there, please reference this pull request with `:pr:` and credit yourself (and other contributors if applicable) with `:user:`.
+::
+
+    Please add an entry to the change log at `doc/whats_new/v*.rst`. Like the other entries there, please reference this pull request with `:pr:` and credit yourself (and other contributors if applicable) with `:user:`.
 
 PR: Don't change unrelated
-    ::
 
-        Please do not change unrelated lines. It makes your contribution harder to review and may introduce merge conflicts to other pull requests.
+::
+
+    Please do not change unrelated lines. It makes your contribution harder to review and may introduce merge conflicts to other pull requests.
 
 .. highlight:: default
 
@@ -244,19 +264,19 @@ valgrind_.
 Valgrind is a command-line tool that can trace memory errors in a variety of
 code. Follow these steps:
 
-  1. Install `valgrind`_ on your system.
+1. Install `valgrind`_ on your system.
 
-  2. Download the python valgrind suppression file: `valgrind-python.supp`_.
+2. Download the python valgrind suppression file: `valgrind-python.supp`_.
 
-  3. Follow the directions in the `README.valgrind`_ file to customize your
-     python suppressions. If you don't, you will have spurious output coming
-     related to the python interpreter instead of your own code.
+3. Follow the directions in the `README.valgrind`_ file to customize your
+   python suppressions. If you don't, you will have spurious output coming
+   related to the python interpreter instead of your own code.
 
-  4. Run valgrind as follows:
+4. Run valgrind as follows:
 
-.. prompt:: bash $
+   .. prompt:: bash $
 
-  valgrind -v --suppressions=valgrind-python.supp python my_test_script.py
+        valgrind -v --suppressions=valgrind-python.supp python my_test_script.py
 
 .. _valgrind: https://valgrind.org
 .. _`README.valgrind`: https://github.com/python/cpython/blob/master/Misc/README.valgrind
diff --git a/doc/model_persistence.rst b/doc/model_persistence.rst
index 53f01fd019d79..b8da5c8a3961f 100644
--- a/doc/model_persistence.rst
+++ b/doc/model_persistence.rst
@@ -58,7 +58,7 @@ with::
 When an estimator is unpickled with a scikit-learn version that is inconsistent
 with the version the estimator was pickled with, a
 :class:`~sklearn.exceptions.InconsistentVersionWarning` is raised. This warning
-can be caught to obtain the original version the estimator was pickled with:
+can be caught to obtain the original version the estimator was pickled with::
 
   from sklearn.exceptions import InconsistentVersionWarning
   warnings.simplefilter("error", InconsistentVersionWarning)
diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst
index 4cd86a0bf70c1..c64b3d9d646c9 100644
--- a/doc/modules/clustering.rst
+++ b/doc/modules/clustering.rst
@@ -1042,16 +1042,16 @@ efficiently, HDBSCAN first extracts a minimum spanning tree (MST) from the fully
 -connected mutual reachability graph, then greedily cuts the edges with highest
 weight. An outline of the HDBSCAN algorithm is as follows:
 
-  1. Extract the MST of :math:`G_{ms}`
-  2. Extend the MST by adding a "self edge" for each vertex, with weight equal
-     to the core distance of the underlying sample.
-  3. Initialize a single cluster and label for the MST.
-  4. Remove the edge with the greatest weight from the MST (ties are
-     removed simultaneously).
-  5. Assign cluster labels to the connected components which contain the
-     end points of the now-removed edge. If the component does not have at least
-     one edge it is instead assigned a "null" label marking it as noise.
-  6. Repeat 4-5 until there are no more connected components.
+1. Extract the MST of :math:`G_{ms}`.
+2. Extend the MST by adding a "self edge" for each vertex, with weight equal
+   to the core distance of the underlying sample.
+3. Initialize a single cluster and label for the MST.
+4. Remove the edge with the greatest weight from the MST (ties are
+   removed simultaneously).
+5. Assign cluster labels to the connected components which contain the
+   end points of the now-removed edge. If the component does not have at least
+   one edge it is instead assigned a "null" label marking it as noise.
+6. Repeat 4-5 until there are no more connected components.
 
 HDBSCAN is therefore able to obtain all possible partitions achievable by
 DBSCAN* for a fixed choice of `min_samples` in a hierarchical fashion.
@@ -1233,11 +1233,11 @@ clusters (labels) and the samples are mapped to the global label of the nearest
 
 **BIRCH or MiniBatchKMeans?**
 
- - BIRCH does not scale very well to high dimensional data. As a rule of thumb if
-   ``n_features`` is greater than twenty, it is generally better to use MiniBatchKMeans.
- - If the number of instances of data needs to be reduced, or if one wants a
-   large number of subclusters either as a preprocessing step or otherwise,
-   BIRCH is more useful than MiniBatchKMeans.
+- BIRCH does not scale very well to high dimensional data. As a rule of thumb if
+  ``n_features`` is greater than twenty, it is generally better to use MiniBatchKMeans.
+- If the number of instances of data needs to be reduced, or if one wants a
+  large number of subclusters either as a preprocessing step or otherwise,
+  BIRCH is more useful than MiniBatchKMeans.
 
 
 **How to use partial_fit?**
@@ -1245,12 +1245,12 @@ clusters (labels) and the samples are mapped to the global label of the nearest
 To avoid the computation of global clustering, for every call of ``partial_fit``
 the user is advised
 
- 1. To set ``n_clusters=None`` initially
- 2. Train all data by multiple calls to partial_fit.
- 3. Set ``n_clusters`` to a required value using
-    ``brc.set_params(n_clusters=n_clusters)``.
- 4. Call ``partial_fit`` finally with no arguments, i.e. ``brc.partial_fit()``
-    which performs the global clustering.
+1. To set ``n_clusters=None`` initially
+2. Train all data by multiple calls to partial_fit.
+3. Set ``n_clusters`` to a required value using
+   ``brc.set_params(n_clusters=n_clusters)``.
+4. Call ``partial_fit`` finally with no arguments, i.e. ``brc.partial_fit()``
+   which performs the global clustering.
 
 .. image:: ../auto_examples/cluster/images/sphx_glr_plot_birch_vs_minibatchkmeans_001.png
     :target: ../auto_examples/cluster/plot_birch_vs_minibatchkmeans.html
@@ -2196,19 +2196,19 @@ under the true and predicted clusterings.
 
 It has the following entries:
 
-  :math:`C_{00}` : number of pairs with both clusterings having the samples
-  not clustered together
+:math:`C_{00}` : number of pairs with both clusterings having the samples
+not clustered together
 
-  :math:`C_{10}` : number of pairs with the true label clustering having the
-  samples clustered together but the other clustering not having the samples
-  clustered together
+:math:`C_{10}` : number of pairs with the true label clustering having the
+samples clustered together but the other clustering not having the samples
+clustered together
 
-  :math:`C_{01}` : number of pairs with the true label clustering not having
-  the samples clustered together but the other clustering having the samples
-  clustered together
+:math:`C_{01}` : number of pairs with the true label clustering not having
+the samples clustered together but the other clustering having the samples
+clustered together
 
-  :math:`C_{11}` : number of pairs with both clusterings having the samples
-  clustered together
+:math:`C_{11}` : number of pairs with both clusterings having the samples
+clustered together
 
 Considering a pair of samples that is clustered together a positive pair,
 then as in binary classification the count of true negatives is
diff --git a/doc/modules/cross_validation.rst b/doc/modules/cross_validation.rst
index 53206bce28c8f..24a8e2f2d2acd 100644
--- a/doc/modules/cross_validation.rst
+++ b/doc/modules/cross_validation.rst
@@ -86,10 +86,10 @@ the training set is split into *k* smaller sets
 but generally follow the same principles).
 The following procedure is followed for each of the *k* "folds":
 
- * A model is trained using :math:`k-1` of the folds as training data;
- * the resulting model is validated on the remaining part of the data
-   (i.e., it is used as a test set to compute a performance measure
-   such as accuracy).
+* A model is trained using :math:`k-1` of the folds as training data;
+* the resulting model is validated on the remaining part of the data
+  (i.e., it is used as a test set to compute a performance measure
+  such as accuracy).
 
 The performance measure reported by *k*-fold cross-validation
 is then the average of the values computed in the loop.
diff --git a/doc/modules/decomposition.rst b/doc/modules/decomposition.rst
index 223985c6579f0..e8241a92cfc3b 100644
--- a/doc/modules/decomposition.rst
+++ b/doc/modules/decomposition.rst
@@ -72,11 +72,11 @@ exactly match the results of :class:`PCA` while processing the data in a
 minibatch fashion. :class:`IncrementalPCA` makes it possible to implement
 out-of-core Principal Component Analysis either by:
 
- * Using its ``partial_fit`` method on chunks of data fetched sequentially
-   from the local hard drive or a network database.
+* Using its ``partial_fit`` method on chunks of data fetched sequentially
+  from the local hard drive or a network database.
 
- * Calling its fit method on a memory mapped file using
-   ``numpy.memmap``.
+* Calling its fit method on a memory mapped file using
+  ``numpy.memmap``.
 
 :class:`IncrementalPCA` only stores estimates of component and noise variances,
 in order update ``explained_variance_ratio_`` incrementally. This is why
@@ -358,14 +358,14 @@ components is less than 10 (strict) and the number of samples is more than 200
 
     * *randomized* solver:
 
-        * Algorithm 4.3 in
-          :arxiv:`"Finding structure with randomness: Stochastic
-          algorithms for constructing approximate matrix decompositions" <0909.4061>`
-          Halko, et al. (2009)
+      * Algorithm 4.3 in
+        :arxiv:`"Finding structure with randomness: Stochastic
+        algorithms for constructing approximate matrix decompositions" <0909.4061>`
+        Halko, et al. (2009)
 
-        * :arxiv:`"An implementation of a randomized algorithm
-          for principal component analysis" <1412.3510>`
-          A. Szlam et al. (2014)
+      * :arxiv:`"An implementation of a randomized algorithm
+        for principal component analysis" <1412.3510>`
+        A. Szlam et al. (2014)
 
     * *arpack* solver:
       `scipy.sparse.linalg.eigsh documentation
@@ -636,7 +636,7 @@ does not fit into the memory.
    computationally efficient and implements on-line learning with a
    ``partial_fit`` method.
 
-    Example: :ref:`sphx_glr_auto_examples_cluster_plot_dict_face_patches.py`
+   Example: :ref:`sphx_glr_auto_examples_cluster_plot_dict_face_patches.py`
 
 .. currentmodule:: sklearn.decomposition
 
@@ -1008,10 +1008,10 @@ The graphical model of LDA is a three-level generative model:
 Note on notations presented in the graphical model above, which can be found in
 Hoffman et al. (2013):
 
-  * The corpus is a collection of :math:`D` documents.
-  * A document is a sequence of :math:`N` words.
-  * There are :math:`K` topics in the corpus.
-  * The boxes represent repeated sampling.
+* The corpus is a collection of :math:`D` documents.
+* A document is a sequence of :math:`N` words.
+* There are :math:`K` topics in the corpus.
+* The boxes represent repeated sampling.
 
 In the graphical model, each node is a random variable and has a role in the
 generative process. A shaded node indicates an observed variable and an unshaded
@@ -1029,21 +1029,21 @@ When modeling text corpora, the model assumes the following generative process
 for a corpus with :math:`D` documents and :math:`K` topics, with :math:`K`
 corresponding to `n_components` in the API:
 
-  1. For each topic :math:`k \in K`, draw :math:`\beta_k \sim
-     \mathrm{Dirichlet}(\eta)`. This provides a distribution over the words,
-     i.e. the probability of a word appearing in topic :math:`k`.
-     :math:`\eta` corresponds to `topic_word_prior`.
+1. For each topic :math:`k \in K`, draw :math:`\beta_k \sim
+   \mathrm{Dirichlet}(\eta)`. This provides a distribution over the words,
+   i.e. the probability of a word appearing in topic :math:`k`.
+   :math:`\eta` corresponds to `topic_word_prior`.
 
-  2. For each document :math:`d \in D`, draw the topic proportions
-     :math:`\theta_d \sim \mathrm{Dirichlet}(\alpha)`. :math:`\alpha`
-     corresponds to `doc_topic_prior`.
+2. For each document :math:`d \in D`, draw the topic proportions
+   :math:`\theta_d \sim \mathrm{Dirichlet}(\alpha)`. :math:`\alpha`
+   corresponds to `doc_topic_prior`.
 
-  3. For each word :math:`i` in document :math:`d`:
+3. For each word :math:`i` in document :math:`d`:
 
-    a. Draw the topic assignment :math:`z_{di} \sim \mathrm{Multinomial}
-       (\theta_d)`
-    b. Draw the observed word :math:`w_{ij} \sim \mathrm{Multinomial}
-       (\beta_{z_{di}})`
+   a. Draw the topic assignment :math:`z_{di} \sim \mathrm{Multinomial}
+      (\theta_d)`
+   b. Draw the observed word :math:`w_{ij} \sim \mathrm{Multinomial}
+      (\beta_{z_{di}})`
 
 For parameter estimation, the posterior distribution is:
 
diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index 73b4420960717..334e00e35a848 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -285,13 +285,13 @@ model.
 
 For a predictor :math:`F` with two features:
 
- - a **monotonic increase constraint** is a constraint of the form:
-    .. math::
-        x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2)
+- a **monotonic increase constraint** is a constraint of the form:
+  .. math::
+      x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2)
 
- - a **monotonic decrease constraint** is a constraint of the form:
-    .. math::
-        x_1 \leq x_1' \implies F(x_1, x_2) \geq F(x_1', x_2)
+- a **monotonic decrease constraint** is a constraint of the form:
+  .. math::
+      x_1 \leq x_1' \implies F(x_1, x_2) \geq F(x_1', x_2)
 
 You can specify a monotonic constraint on each feature using the
 `monotonic_cst` parameter. For each feature, a value of 0 indicates no
@@ -311,8 +311,8 @@ Nevertheless, monotonic constraints only marginally constrain feature effects on
 For instance, monotonic increase and decrease constraints cannot be used to enforce the
 following modelling constraint:
 
-    .. math::
-        x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2')
+.. math::
+    x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2')
 
 Also, monotonic constraints are not supported for multiclass classification.
 
@@ -584,9 +584,9 @@ Regression
 GBRT regressors are additive models whose prediction :math:`\hat{y}_i` for a
 given input :math:`x_i` is of the following form:
 
-  .. math::
+.. math::
 
-    \hat{y}_i = F_M(x_i) = \sum_{m=1}^{M} h_m(x_i)
+  \hat{y}_i = F_M(x_i) = \sum_{m=1}^{M} h_m(x_i)
 
 where the :math:`h_m` are estimators called *weak learners* in the context
 of boosting. Gradient Tree Boosting uses :ref:`decision tree regressors
@@ -595,17 +595,17 @@ of boosting. Gradient Tree Boosting uses :ref:`decision tree regressors
 
 Similar to other boosting algorithms, a GBRT is built in a greedy fashion:
 
-  .. math::
+.. math::
 
-    F_m(x) = F_{m-1}(x) + h_m(x),
+  F_m(x) = F_{m-1}(x) + h_m(x),
 
 where the newly added tree :math:`h_m` is fitted in order to minimize a sum
 of losses :math:`L_m`, given the previous ensemble :math:`F_{m-1}`:
 
-  .. math::
+.. math::
 
-    h_m =  \arg\min_{h} L_m = \arg\min_{h} \sum_{i=1}^{n}
-    l(y_i, F_{m-1}(x_i) + h(x_i)),
+  h_m =  \arg\min_{h} L_m = \arg\min_{h} \sum_{i=1}^{n}
+  l(y_i, F_{m-1}(x_i) + h(x_i)),
 
 where :math:`l(y_i, F(x_i))` is defined by the `loss` parameter, detailed
 in the next section.
@@ -618,12 +618,12 @@ argument.
 Using a first-order Taylor approximation, the value of :math:`l` can be
 approximated as follows:
 
-  .. math::
+.. math::
 
-    l(y_i, F_{m-1}(x_i) + h_m(x_i)) \approx
-    l(y_i, F_{m-1}(x_i))
-    + h_m(x_i)
-    \left[ \frac{\partial l(y_i, F(x_i))}{\partial F(x_i)} \right]_{F=F_{m - 1}}.
+  l(y_i, F_{m-1}(x_i) + h_m(x_i)) \approx
+  l(y_i, F_{m-1}(x_i))
+  + h_m(x_i)
+  \left[ \frac{\partial l(y_i, F(x_i))}{\partial F(x_i)} \right]_{F=F_{m - 1}}.
 
 .. note::
 
@@ -640,9 +640,9 @@ differentiable. We will denote it by :math:`g_i`.
 
 Removing the constant terms, we have:
 
-  .. math::
+.. math::
 
-    h_m \approx \arg\min_{h} \sum_{i=1}^{n} h(x_i) g_i
+  h_m \approx \arg\min_{h} \sum_{i=1}^{n} h(x_i) g_i
 
 This is minimized if :math:`h(x_i)` is fitted to predict a value that is
 proportional to the negative gradient :math:`-g_i`. Therefore, at each
@@ -691,40 +691,40 @@ Loss Functions
 The following loss functions are supported and can be specified using
 the parameter ``loss``:
 
-  * Regression
-
-    * Squared error (``'squared_error'``): The natural choice for regression
-      due to its superior computational properties. The initial model is
-      given by the mean of the target values.
-    * Absolute error (``'absolute_error'``): A robust loss function for
-      regression. The initial model is given by the median of the
-      target values.
-    * Huber (``'huber'``): Another robust loss function that combines
-      least squares and least absolute deviation; use ``alpha`` to
-      control the sensitivity with regards to outliers (see [Friedman2001]_ for
-      more details).
-    * Quantile (``'quantile'``): A loss function for quantile regression.
-      Use ``0 < alpha < 1`` to specify the quantile. This loss function
-      can be used to create prediction intervals
-      (see :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_quantile.py`).
-
-  * Classification
-
-    * Binary log-loss (``'log-loss'``): The binomial
-      negative log-likelihood loss function for binary classification. It provides
-      probability estimates.  The initial model is given by the
-      log odds-ratio.
-    * Multi-class log-loss (``'log-loss'``): The multinomial
-      negative log-likelihood loss function for multi-class classification with
-      ``n_classes`` mutually exclusive classes. It provides
-      probability estimates.  The initial model is given by the
-      prior probability of each class. At each iteration ``n_classes``
-      regression trees have to be constructed which makes GBRT rather
-      inefficient for data sets with a large number of classes.
-    * Exponential loss (``'exponential'``): The same loss function
-      as :class:`AdaBoostClassifier`. Less robust to mislabeled
-      examples than ``'log-loss'``; can only be used for binary
-      classification.
+* Regression
+
+  * Squared error (``'squared_error'``): The natural choice for regression
+    due to its superior computational properties. The initial model is
+    given by the mean of the target values.
+  * Absolute error (``'absolute_error'``): A robust loss function for
+    regression. The initial model is given by the median of the
+    target values.
+  * Huber (``'huber'``): Another robust loss function that combines
+    least squares and least absolute deviation; use ``alpha`` to
+    control the sensitivity with regards to outliers (see [Friedman2001]_ for
+    more details).
+  * Quantile (``'quantile'``): A loss function for quantile regression.
+    Use ``0 < alpha < 1`` to specify the quantile. This loss function
+    can be used to create prediction intervals
+    (see :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_quantile.py`).
+
+* Classification
+
+  * Binary log-loss (``'log-loss'``): The binomial
+    negative log-likelihood loss function for binary classification. It provides
+    probability estimates.  The initial model is given by the
+    log odds-ratio.
+  * Multi-class log-loss (``'log-loss'``): The multinomial
+    negative log-likelihood loss function for multi-class classification with
+    ``n_classes`` mutually exclusive classes. It provides
+    probability estimates.  The initial model is given by the
+    prior probability of each class. At each iteration ``n_classes``
+    regression trees have to be constructed which makes GBRT rather
+    inefficient for data sets with a large number of classes.
+  * Exponential loss (``'exponential'``): The same loss function
+    as :class:`AdaBoostClassifier`. Less robust to mislabeled
+    examples than ``'log-loss'``; can only be used for binary
+    classification.
 
 .. _gradient_boosting_shrinkage:
 
@@ -1171,17 +1171,17 @@ shallow decision trees).
 Bagging methods come in many flavours but mostly differ from each other by the
 way they draw random subsets of the training set:
 
-  * When random subsets of the dataset are drawn as random subsets of the
-    samples, then this algorithm is known as Pasting [B1999]_.
+* When random subsets of the dataset are drawn as random subsets of the
+  samples, then this algorithm is known as Pasting [B1999]_.
 
-  * When samples are drawn with replacement, then the method is known as
-    Bagging [B1996]_.
+* When samples are drawn with replacement, then the method is known as
+  Bagging [B1996]_.
 
-  * When random subsets of the dataset are drawn as random subsets of
-    the features, then the method is known as Random Subspaces [H1998]_.
+* When random subsets of the dataset are drawn as random subsets of
+  the features, then the method is known as Random Subspaces [H1998]_.
 
-  * Finally, when base estimators are built on subsets of both samples and
-    features, then the method is known as Random Patches [LG2012]_.
+* Finally, when base estimators are built on subsets of both samples and
+  features, then the method is known as Random Patches [LG2012]_.
 
 In scikit-learn, bagging methods are offered as a unified
 :class:`BaggingClassifier` meta-estimator  (resp. :class:`BaggingRegressor`),
@@ -1591,10 +1591,10 @@ concentrate on the examples that are missed by the previous ones in the sequence
 
 AdaBoost can be used both for classification and regression problems:
 
-  - For multi-class classification, :class:`AdaBoostClassifier` implements
-    AdaBoost.SAMME [ZZRH2009]_.
+- For multi-class classification, :class:`AdaBoostClassifier` implements
+  AdaBoost.SAMME [ZZRH2009]_.
 
-  - For regression, :class:`AdaBoostRegressor` implements AdaBoost.R2 [D1997]_.
+- For regression, :class:`AdaBoostRegressor` implements AdaBoost.R2 [D1997]_.
 
 Usage
 -----
diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
index 9653ba9d7b646..7ac538a89849b 100644
--- a/doc/modules/feature_extraction.rst
+++ b/doc/modules/feature_extraction.rst
@@ -615,7 +615,7 @@ As usual the best way to adjust the feature extraction parameters
 is to use a cross-validated grid search, for instance by pipelining the
 feature extractor with a classifier:
 
- * :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`
+* :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`
 
 |details-end|
 
@@ -715,18 +715,18 @@ In particular in a **supervised setting** it can be successfully combined
 with fast and scalable linear models to train **document classifiers**,
 for instance:
 
- * :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
+* :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
 
 In an **unsupervised setting** it can be used to group similar documents
 together by applying clustering algorithms such as :ref:`k_means`:
 
-  * :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`
+* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`
 
 Finally it is possible to discover the main topics of a corpus by
 relaxing the hard assignment constraint of clustering, for instance by
 using :ref:`NMF`:
 
-  * :ref:`sphx_glr_auto_examples_applications_plot_topics_extraction_with_nmf_lda.py`
+* :ref:`sphx_glr_auto_examples_applications_plot_topics_extraction_with_nmf_lda.py`
 
 
 Limitations of the Bag of Words representation
@@ -923,19 +923,19 @@ to the vectorizer constructor::
 
 In particular we name:
 
-  * ``preprocessor``: a callable that takes an entire document as input (as a
-    single string), and returns a possibly transformed version of the document,
-    still as an entire string. This can be used to remove HTML tags, lowercase
-    the entire document, etc.
+* ``preprocessor``: a callable that takes an entire document as input (as a
+  single string), and returns a possibly transformed version of the document,
+  still as an entire string. This can be used to remove HTML tags, lowercase
+  the entire document, etc.
 
-  * ``tokenizer``: a callable that takes the output from the preprocessor
-    and splits it into tokens, then returns a list of these.
+* ``tokenizer``: a callable that takes the output from the preprocessor
+  and splits it into tokens, then returns a list of these.
 
-  * ``analyzer``: a callable that replaces the preprocessor and tokenizer.
-    The default analyzers all call the preprocessor and tokenizer, but custom
-    analyzers will skip this. N-gram extraction and stop word filtering take
-    place at the analyzer level, so a custom analyzer may have to reproduce
-    these steps.
+* ``analyzer``: a callable that replaces the preprocessor and tokenizer.
+  The default analyzers all call the preprocessor and tokenizer, but custom
+  analyzers will skip this. N-gram extraction and stop word filtering take
+  place at the analyzer level, so a custom analyzer may have to reproduce
+  these steps.
 
 (Lucene users might recognize these names, but be aware that scikit-learn
 concepts may not map one-to-one onto Lucene concepts.)
@@ -951,53 +951,53 @@ factory methods instead of passing custom functions.
 
 Some tips and tricks:
 
-  * If documents are pre-tokenized by an external package, then store them in
-    files (or strings) with the tokens separated by whitespace and pass
-    ``analyzer=str.split``
-  * Fancy token-level analysis such as stemming, lemmatizing, compound
-    splitting, filtering based on part-of-speech, etc. are not included in the
-    scikit-learn codebase, but can be added by customizing either the
-    tokenizer or the analyzer.
-    Here's a ``CountVectorizer`` with a tokenizer and lemmatizer using
-    `NLTK <https://www.nltk.org/>`_::
-
-        >>> from nltk import word_tokenize          # doctest: +SKIP
-        >>> from nltk.stem import WordNetLemmatizer # doctest: +SKIP
-        >>> class LemmaTokenizer:
-        ...     def __init__(self):
-        ...         self.wnl = WordNetLemmatizer()
-        ...     def __call__(self, doc):
-        ...         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
-        ...
-        >>> vect = CountVectorizer(tokenizer=LemmaTokenizer())  # doctest: +SKIP
-
-    (Note that this will not filter out punctuation.)
-
-
-    The following example will, for instance, transform some British spelling
-    to American spelling::
-
-        >>> import re
-        >>> def to_british(tokens):
-        ...     for t in tokens:
-        ...         t = re.sub(r"(...)our$", r"\1or", t)
-        ...         t = re.sub(r"([bt])re$", r"\1er", t)
-        ...         t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
-        ...         t = re.sub(r"ogue$", "og", t)
-        ...         yield t
-        ...
-        >>> class CustomVectorizer(CountVectorizer):
-        ...     def build_tokenizer(self):
-        ...         tokenize = super().build_tokenizer()
-        ...         return lambda doc: list(to_british(tokenize(doc)))
-        ...
-        >>> print(CustomVectorizer().build_analyzer()(u"color colour"))
-        [...'color', ...'color']
-
-    for other styles of preprocessing; examples include stemming, lemmatization,
-    or normalizing numerical tokens, with the latter illustrated in:
-
-     * :ref:`sphx_glr_auto_examples_bicluster_plot_bicluster_newsgroups.py`
+* If documents are pre-tokenized by an external package, then store them in
+  files (or strings) with the tokens separated by whitespace and pass
+  ``analyzer=str.split``
+* Fancy token-level analysis such as stemming, lemmatizing, compound
+  splitting, filtering based on part-of-speech, etc. are not included in the
+  scikit-learn codebase, but can be added by customizing either the
+  tokenizer or the analyzer.
+  Here's a ``CountVectorizer`` with a tokenizer and lemmatizer using
+  `NLTK <https://www.nltk.org/>`_::
+
+      >>> from nltk import word_tokenize          # doctest: +SKIP
+      >>> from nltk.stem import WordNetLemmatizer # doctest: +SKIP
+      >>> class LemmaTokenizer:
+      ...     def __init__(self):
+      ...         self.wnl = WordNetLemmatizer()
+      ...     def __call__(self, doc):
+      ...         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
+      ...
+      >>> vect = CountVectorizer(tokenizer=LemmaTokenizer())  # doctest: +SKIP
+
+  (Note that this will not filter out punctuation.)
+
+
+  The following example will, for instance, transform some British spelling
+  to American spelling::
+
+      >>> import re
+      >>> def to_british(tokens):
+      ...     for t in tokens:
+      ...         t = re.sub(r"(...)our$", r"\1or", t)
+      ...         t = re.sub(r"([bt])re$", r"\1er", t)
+      ...         t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
+      ...         t = re.sub(r"ogue$", "og", t)
+      ...         yield t
+      ...
+      >>> class CustomVectorizer(CountVectorizer):
+      ...     def build_tokenizer(self):
+      ...         tokenize = super().build_tokenizer()
+      ...         return lambda doc: list(to_british(tokenize(doc)))
+      ...
+      >>> print(CustomVectorizer().build_analyzer()(u"color colour"))
+      [...'color', ...'color']
+
+  for other styles of preprocessing; examples include stemming, lemmatization,
+  or normalizing numerical tokens, with the latter illustrated in:
+
+  * :ref:`sphx_glr_auto_examples_bicluster_plot_bicluster_newsgroups.py`
 
 
 Customizing the vectorizer can also be useful when handling Asian languages
diff --git a/doc/modules/feature_selection.rst b/doc/modules/feature_selection.rst
index 7fcec524e7168..1ae950acdfbb6 100644
--- a/doc/modules/feature_selection.rst
+++ b/doc/modules/feature_selection.rst
@@ -57,18 +57,18 @@ univariate statistical tests. It can be seen as a preprocessing step
 to an estimator. Scikit-learn exposes feature selection routines
 as objects that implement the ``transform`` method:
 
- * :class:`SelectKBest` removes all but the :math:`k` highest scoring features
+* :class:`SelectKBest` removes all but the :math:`k` highest scoring features
 
- * :class:`SelectPercentile` removes all but a user-specified highest scoring
-   percentage of features
+* :class:`SelectPercentile` removes all but a user-specified highest scoring
+  percentage of features
 
- * using common univariate statistical tests for each feature:
-   false positive rate :class:`SelectFpr`, false discovery rate
-   :class:`SelectFdr`, or family wise error :class:`SelectFwe`.
+* using common univariate statistical tests for each feature:
+  false positive rate :class:`SelectFpr`, false discovery rate
+  :class:`SelectFdr`, or family wise error :class:`SelectFwe`.
 
- * :class:`GenericUnivariateSelect` allows to perform univariate feature
-   selection with a configurable strategy. This allows to select the best
-   univariate selection strategy with hyper-parameter search estimator.
+* :class:`GenericUnivariateSelect` allows to perform univariate feature
+  selection with a configurable strategy. This allows to select the best
+  univariate selection strategy with hyper-parameter search estimator.
 
 For instance, we can use a F-test to retrieve the two
 best features for a dataset as follows:
@@ -87,9 +87,9 @@ These objects take as input a scoring function that returns univariate scores
 and p-values (or only scores for :class:`SelectKBest` and
 :class:`SelectPercentile`):
 
- * For regression: :func:`r_regression`, :func:`f_regression`, :func:`mutual_info_regression`
+* For regression: :func:`r_regression`, :func:`f_regression`, :func:`mutual_info_regression`
 
- * For classification: :func:`chi2`, :func:`f_classif`, :func:`mutual_info_classif`
+* For classification: :func:`chi2`, :func:`f_classif`, :func:`mutual_info_classif`
 
 The methods based on F-test estimate the degree of linear dependency between
 two random variables. On the other hand, mutual information methods can capture
diff --git a/doc/modules/gaussian_process.rst b/doc/modules/gaussian_process.rst
index 55960e901b166..58e56a557ed73 100644
--- a/doc/modules/gaussian_process.rst
+++ b/doc/modules/gaussian_process.rst
@@ -11,25 +11,25 @@ to solve *regression* and *probabilistic classification* problems.
 
 The advantages of Gaussian processes are:
 
-    - The prediction interpolates the observations (at least for regular
-      kernels).
+- The prediction interpolates the observations (at least for regular
+  kernels).
 
-    - The prediction is probabilistic (Gaussian) so that one can compute
-      empirical confidence intervals and decide based on those if one should
-      refit (online fitting, adaptive fitting) the prediction in some
-      region of interest.
+- The prediction is probabilistic (Gaussian) so that one can compute
+  empirical confidence intervals and decide based on those if one should
+  refit (online fitting, adaptive fitting) the prediction in some
+  region of interest.
 
-    - Versatile: different :ref:`kernels
-      <gp_kernels>` can be specified. Common kernels are provided, but
-      it is also possible to specify custom kernels.
+- Versatile: different :ref:`kernels
+  <gp_kernels>` can be specified. Common kernels are provided, but
+  it is also possible to specify custom kernels.
 
 The disadvantages of Gaussian processes include:
 
-    - Our implementation is not sparse, i.e., they use the whole samples/features
-      information to perform the prediction.
+- Our implementation is not sparse, i.e., they use the whole samples/features
+  information to perform the prediction.
 
-    - They lose efficiency in high dimensional spaces -- namely when the number
-      of features exceeds a few dozens.
+- They lose efficiency in high dimensional spaces -- namely when the number
+  of features exceeds a few dozens.
 
 
 .. _gpr:
@@ -386,7 +386,7 @@ Matérn kernel
 -------------
 The :class:`Matern` kernel is a stationary kernel and a generalization of the
 :class:`RBF` kernel. It has an additional parameter :math:`\nu` which controls
-the smoothness of the resulting function. It is parameterized by a length-scale parameter :math:`l>0`, which can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs :math:`x` (anisotropic variant of the kernel). 
+the smoothness of the resulting function. It is parameterized by a length-scale parameter :math:`l>0`, which can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs :math:`x` (anisotropic variant of the kernel).
 
 |details-start|
 **Mathematical implementation of Matérn kernel**
diff --git a/doc/modules/grid_search.rst b/doc/modules/grid_search.rst
index efdde897e841b..01c5a5c72ee52 100644
--- a/doc/modules/grid_search.rst
+++ b/doc/modules/grid_search.rst
@@ -135,14 +135,14 @@ variate sample) method to sample a value. A call to the ``rvs`` function should
 provide independent random samples from possible parameter values on
 consecutive calls.
 
-    .. warning::
-
-        The distributions in ``scipy.stats`` prior to version scipy 0.16
-        do not allow specifying a random state. Instead, they use the global
-        numpy random state, that can be seeded via ``np.random.seed`` or set
-        using ``np.random.set_state``. However, beginning scikit-learn 0.18,
-        the :mod:`sklearn.model_selection` module sets the random state provided
-        by the user if scipy >= 0.16 is also available.
+.. warning::
+
+    The distributions in ``scipy.stats`` prior to version scipy 0.16
+    do not allow specifying a random state. Instead, they use the global
+    numpy random state, that can be seeded via ``np.random.seed`` or set
+    using ``np.random.set_state``. However, beginning scikit-learn 0.18,
+    the :mod:`sklearn.model_selection` module sets the random state provided
+    by the user if scipy >= 0.16 is also available.
 
 For continuous parameters, such as ``C`` above, it is important to specify
 a continuous distribution to take full advantage of the randomization. This way,
diff --git a/doc/modules/isotonic.rst b/doc/modules/isotonic.rst
index 8967ef18afcb3..c30ee83b74241 100644
--- a/doc/modules/isotonic.rst
+++ b/doc/modules/isotonic.rst
@@ -9,10 +9,10 @@ Isotonic regression
 The class :class:`IsotonicRegression` fits a non-decreasing real function to
 1-dimensional data. It solves the following problem:
 
-  minimize :math:`\sum_i w_i (y_i - \hat{y}_i)^2`
-
-  subject to :math:`\hat{y}_i \le \hat{y}_j` whenever :math:`X_i \le X_j`,
+.. math::
+    \min \sum_i w_i (y_i - \hat{y}_i)^2
 
+subject to :math:`\hat{y}_i \le \hat{y}_j` whenever :math:`X_i \le X_j`,
 where the weights :math:`w_i` are strictly positive, and both `X` and `y` are
 arbitrary real quantities.
 
diff --git a/doc/modules/kernel_approximation.rst b/doc/modules/kernel_approximation.rst
index 30c5a71b1417d..0c67c36178e3b 100644
--- a/doc/modules/kernel_approximation.rst
+++ b/doc/modules/kernel_approximation.rst
@@ -57,10 +57,10 @@ points.
 
 where:
 
-    * :math:`U` is orthonormal
-    * :math:`Ʌ` is diagonal matrix of eigenvalues
-    * :math:`U_1` is orthonormal matrix of samples that were chosen
-    * :math:`U_2` is orthonormal matrix of samples that were not chosen
+* :math:`U` is orthonormal
+* :math:`\Lambda` is diagonal matrix of eigenvalues
+* :math:`U_1` is orthonormal matrix of samples that were chosen
+* :math:`U_2` is orthonormal matrix of samples that were not chosen
 
 Given that :math:`U_1 \Lambda U_1^T` can be obtained by orthonormalization of
 the matrix :math:`K_{11}`, and :math:`U_2 \Lambda U_1^T` can be evaluated (as
@@ -215,8 +215,8 @@ function given by:
 
 where:
 
-    * ``x``, ``y`` are the input vectors
-    * ``d`` is the kernel degree
+* ``x``, ``y`` are the input vectors
+* ``d`` is the kernel degree
 
 Intuitively, the feature space of the polynomial kernel of degree `d`
 consists of all possible degree-`d` products among input features, which enables
diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index 13fafaf48c953..e538dde2ed6d5 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -215,10 +215,10 @@ Cross-Validation.
 **References**
 |details-split|
 
-    * "Notes on Regularized Least Squares", Rifkin & Lippert (`technical report
-      <http://cbcl.mit.edu/publications/ps/MIT-CSAIL-TR-2007-025.pdf>`_,
-      `course slides
-      <https://www.mit.edu/~9.520/spring07/Classes/rlsslides.pdf>`_).
+* "Notes on Regularized Least Squares", Rifkin & Lippert (`technical report
+  <http://cbcl.mit.edu/publications/ps/MIT-CSAIL-TR-2007-025.pdf>`_,
+  `course slides
+  <https://www.mit.edu/~9.520/spring07/Classes/rlsslides.pdf>`_).
 
 |details-end|
 
@@ -587,30 +587,30 @@ between the features.
 
 The advantages of LARS are:
 
-  - It is numerically efficient in contexts where the number of features
-    is significantly greater than the number of samples.
+- It is numerically efficient in contexts where the number of features
+  is significantly greater than the number of samples.
 
-  - It is computationally just as fast as forward selection and has
-    the same order of complexity as ordinary least squares.
+- It is computationally just as fast as forward selection and has
+  the same order of complexity as ordinary least squares.
 
-  - It produces a full piecewise linear solution path, which is
-    useful in cross-validation or similar attempts to tune the model.
+- It produces a full piecewise linear solution path, which is
+  useful in cross-validation or similar attempts to tune the model.
 
-  - If two features are almost equally correlated with the target,
-    then their coefficients should increase at approximately the same
-    rate. The algorithm thus behaves as intuition would expect, and
-    also is more stable.
+- If two features are almost equally correlated with the target,
+  then their coefficients should increase at approximately the same
+  rate. The algorithm thus behaves as intuition would expect, and
+  also is more stable.
 
-  - It is easily modified to produce solutions for other estimators,
-    like the Lasso.
+- It is easily modified to produce solutions for other estimators,
+  like the Lasso.
 
 The disadvantages of the LARS method include:
 
-  - Because LARS is based upon an iterative refitting of the
-    residuals, it would appear to be especially sensitive to the
-    effects of noise. This problem is discussed in detail by Weisberg
-    in the discussion section of the Efron et al. (2004) Annals of
-    Statistics article.
+- Because LARS is based upon an iterative refitting of the
+  residuals, it would appear to be especially sensitive to the
+  effects of noise. This problem is discussed in detail by Weisberg
+  in the discussion section of the Efron et al. (2004) Annals of
+  Statistics article.
 
 The LARS model can be used via the estimator :class:`Lars`, or its
 low-level implementation :func:`lars_path` or :func:`lars_path_gram`.
@@ -707,11 +707,11 @@ previously chosen dictionary elements.
 **References**
 |details-split|
 
-  * https://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
+* https://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
 
-  * `Matching pursuits with time-frequency dictionaries
-    <https://www.di.ens.fr/~mallat/papiers/MallatPursuit93.pdf>`_,
-    S. G. Mallat, Z. Zhang,
+* `Matching pursuits with time-frequency dictionaries
+  <https://www.di.ens.fr/~mallat/papiers/MallatPursuit93.pdf>`_,
+  S. G. Mallat, Z. Zhang,
 
 |details-end|
 
@@ -743,24 +743,24 @@ estimated from the data.
 
 The advantages of Bayesian Regression are:
 
-    - It adapts to the data at hand.
+- It adapts to the data at hand.
 
-    - It can be used to include regularization parameters in the
-      estimation procedure.
+- It can be used to include regularization parameters in the
+  estimation procedure.
 
 The disadvantages of Bayesian regression include:
 
-    - Inference of the model can be time consuming.
+- Inference of the model can be time consuming.
 
 |details-start|
 **References**
 |details-split|
 
-  * A good introduction to Bayesian methods is given in C. Bishop: Pattern
-    Recognition and Machine learning
+* A good introduction to Bayesian methods is given in C. Bishop: Pattern
+  Recognition and Machine learning
 
-  * Original Algorithm is detailed in the  book `Bayesian learning for neural
-    networks` by Radford M. Neal
+* Original Algorithm is detailed in the  book `Bayesian learning for neural
+  networks` by Radford M. Neal
 
 |details-end|
 
@@ -827,11 +827,11 @@ is more robust to ill-posed problems.
 **References**
 |details-split|
 
-  * Section 3.3 in Christopher M. Bishop: Pattern Recognition and Machine Learning, 2006
+* Section 3.3 in Christopher M. Bishop: Pattern Recognition and Machine Learning, 2006
 
-  * David J. C. MacKay, `Bayesian Interpolation <https://citeseerx.ist.psu.edu/doc_view/pid/b14c7cc3686e82ba40653c6dff178356a33e5e2c>`_, 1992.
+* David J. C. MacKay, `Bayesian Interpolation <https://citeseerx.ist.psu.edu/doc_view/pid/b14c7cc3686e82ba40653c6dff178356a33e5e2c>`_, 1992.
 
-  * Michael E. Tipping, `Sparse Bayesian Learning and the Relevance Vector Machine <https://www.jmlr.org/papers/volume1/tipping01a/tipping01a.pdf>`_, 2001.
+* Michael E. Tipping, `Sparse Bayesian Learning and the Relevance Vector Machine <https://www.jmlr.org/papers/volume1/tipping01a/tipping01a.pdf>`_, 2001.
 
 |details-end|
 
@@ -1372,11 +1372,11 @@ Perceptron
 The :class:`Perceptron` is another simple classification algorithm suitable for
 large scale learning. By default:
 
-    - It does not require a learning rate.
+- It does not require a learning rate.
 
-    - It is not regularized (penalized).
+- It is not regularized (penalized).
 
-    - It updates its model only on mistakes.
+- It updates its model only on mistakes.
 
 The last characteristic implies that the Perceptron is slightly faster to
 train than SGD with the hinge loss and that the resulting models are
@@ -1407,9 +1407,9 @@ For classification, :class:`PassiveAggressiveClassifier` can be used with
 **References**
 |details-split|
 
- * `"Online Passive-Aggressive Algorithms"
-   <http://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf>`_
-   K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer - JMLR 7 (2006)
+* `"Online Passive-Aggressive Algorithms"
+  <http://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf>`_
+  K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer - JMLR 7 (2006)
 
 |details-end|
 
diff --git a/doc/modules/metrics.rst b/doc/modules/metrics.rst
index 71e914afad192..caea39319e869 100644
--- a/doc/modules/metrics.rst
+++ b/doc/modules/metrics.rst
@@ -28,9 +28,9 @@ There are a number of ways to convert between a distance metric and a
 similarity measure, such as a kernel. Let ``D`` be the distance, and ``S`` be
 the kernel:
 
-    1. ``S = np.exp(-D * gamma)``, where one heuristic for choosing
-       ``gamma`` is ``1 / num_features``
-    2. ``S = 1. / (D / np.max(D))``
+1. ``S = np.exp(-D * gamma)``, where one heuristic for choosing
+    ``gamma`` is ``1 / num_features``
+2. ``S = 1. / (D / np.max(D))``
 
 
 .. currentmodule:: sklearn.metrics
@@ -123,8 +123,8 @@ The polynomial kernel is defined as:
 
 where:
 
-    * ``x``, ``y`` are the input vectors
-    * ``d`` is the kernel degree
+* ``x``, ``y`` are the input vectors
+* ``d`` is the kernel degree
 
 If :math:`c_0 = 0` the kernel is said to be homogeneous.
 
@@ -143,9 +143,9 @@ activation function). It is defined as:
 
 where:
 
-    * ``x``, ``y`` are the input vectors
-    * :math:`\gamma` is known as slope
-    * :math:`c_0` is known as intercept
+* ``x``, ``y`` are the input vectors
+* :math:`\gamma` is known as slope
+* :math:`c_0` is known as intercept
 
 .. _rbf_kernel:
 
@@ -165,14 +165,14 @@ the kernel is known as the Gaussian kernel of variance :math:`\sigma^2`.
 
 Laplacian kernel
 ----------------
-The function :func:`laplacian_kernel` is a variant on the radial basis 
+The function :func:`laplacian_kernel` is a variant on the radial basis
 function kernel defined as:
 
 .. math::
 
     k(x, y) = \exp( -\gamma \| x-y \|_1)
 
-where ``x`` and ``y`` are the input vectors and :math:`\|x-y\|_1` is the 
+where ``x`` and ``y`` are the input vectors and :math:`\|x-y\|_1` is the
 Manhattan distance between the input vectors.
 
 It has proven useful in ML applied to noiseless data.
@@ -229,4 +229,3 @@ The chi squared kernel is most commonly used on histograms (bags) of visual word
       categories: A comprehensive study
       International Journal of Computer Vision 2007
       https://hal.archives-ouvertes.fr/hal-00171412/document
-
diff --git a/doc/modules/mixture.rst b/doc/modules/mixture.rst
index e9cc94b1d493d..df5d8020a1369 100644
--- a/doc/modules/mixture.rst
+++ b/doc/modules/mixture.rst
@@ -14,13 +14,13 @@ matrices supported), sample them, and estimate them from
 data. Facilities to help determine the appropriate number of
 components are also provided.
 
- .. figure:: ../auto_examples/mixture/images/sphx_glr_plot_gmm_pdf_001.png
-   :target: ../auto_examples/mixture/plot_gmm_pdf.html
-   :align: center
-   :scale: 50%
+.. figure:: ../auto_examples/mixture/images/sphx_glr_plot_gmm_pdf_001.png
+  :target: ../auto_examples/mixture/plot_gmm_pdf.html
+  :align: center
+  :scale: 50%
 
-   **Two-component Gaussian mixture model:** *data points, and equi-probability
-   surfaces of the model.*
+  **Two-component Gaussian mixture model:** *data points, and equi-probability
+  surfaces of the model.*
 
 A Gaussian mixture model is a probabilistic model that assumes all the
 data points are generated from a mixture of a finite number of
diff --git a/doc/modules/multiclass.rst b/doc/modules/multiclass.rst
index beee41e2aea0b..d3a83997c2dd9 100644
--- a/doc/modules/multiclass.rst
+++ b/doc/modules/multiclass.rst
@@ -147,35 +147,35 @@ Target format
 Valid :term:`multiclass` representations for
 :func:`~sklearn.utils.multiclass.type_of_target` (`y`) are:
 
-  - 1d or column vector containing more than two discrete values. An
-    example of a vector ``y`` for 4 samples:
-
-      >>> import numpy as np
-      >>> y = np.array(['apple', 'pear', 'apple', 'orange'])
-      >>> print(y)
-      ['apple' 'pear' 'apple' 'orange']
-
-  - Dense or sparse :term:`binary` matrix of shape ``(n_samples, n_classes)``
-    with a single sample per row, where each column represents one class. An
-    example of both a dense and sparse :term:`binary` matrix ``y`` for 4
-    samples, where the columns, in order, are apple, orange, and pear:
-
-      >>> import numpy as np
-      >>> from sklearn.preprocessing import LabelBinarizer
-      >>> y = np.array(['apple', 'pear', 'apple', 'orange'])
-      >>> y_dense = LabelBinarizer().fit_transform(y)
-      >>> print(y_dense)
-        [[1 0 0]
-         [0 0 1]
-         [1 0 0]
-         [0 1 0]]
-      >>> from scipy import sparse
-      >>> y_sparse = sparse.csr_matrix(y_dense)
-      >>> print(y_sparse)
-          (0, 0)	1
-          (1, 2)	1
-          (2, 0)	1
-          (3, 1)	1
+- 1d or column vector containing more than two discrete values. An
+  example of a vector ``y`` for 4 samples:
+
+    >>> import numpy as np
+    >>> y = np.array(['apple', 'pear', 'apple', 'orange'])
+    >>> print(y)
+    ['apple' 'pear' 'apple' 'orange']
+
+- Dense or sparse :term:`binary` matrix of shape ``(n_samples, n_classes)``
+  with a single sample per row, where each column represents one class. An
+  example of both a dense and sparse :term:`binary` matrix ``y`` for 4
+  samples, where the columns, in order, are apple, orange, and pear:
+
+    >>> import numpy as np
+    >>> from sklearn.preprocessing import LabelBinarizer
+    >>> y = np.array(['apple', 'pear', 'apple', 'orange'])
+    >>> y_dense = LabelBinarizer().fit_transform(y)
+    >>> print(y_dense)
+    [[1 0 0]
+     [0 0 1]
+     [1 0 0]
+     [0 1 0]]
+    >>> from scipy import sparse
+    >>> y_sparse = sparse.csr_matrix(y_dense)
+    >>> print(y_sparse)
+      (0, 0)	1
+      (1, 2)	1
+      (2, 0)	1
+      (3, 1)	1
 
 For more information about :class:`~sklearn.preprocessing.LabelBinarizer`,
 refer to :ref:`preprocessing_targets`.
diff --git a/doc/modules/neighbors.rst b/doc/modules/neighbors.rst
index 81543be3b494e..b77f1952bece8 100644
--- a/doc/modules/neighbors.rst
+++ b/doc/modules/neighbors.rst
@@ -59,12 +59,12 @@ The choice of neighbors search algorithm is controlled through the keyword
 from the training data.  For a discussion of the strengths and weaknesses
 of each option, see `Nearest Neighbor Algorithms`_.
 
-    .. warning::
+.. warning::
 
-        Regarding the Nearest Neighbors algorithms, if two
-        neighbors :math:`k+1` and :math:`k` have identical distances
-        but different labels, the result will depend on the ordering of the
-        training data.
+    Regarding the Nearest Neighbors algorithms, if two
+    neighbors :math:`k+1` and :math:`k` have identical distances
+    but different labels, the result will depend on the ordering of the
+    training data.
 
 Finding the Nearest Neighbors
 -----------------------------
diff --git a/doc/modules/neural_networks_supervised.rst b/doc/modules/neural_networks_supervised.rst
index 388f32e7c6925..64b394b2db7c5 100644
--- a/doc/modules/neural_networks_supervised.rst
+++ b/doc/modules/neural_networks_supervised.rst
@@ -51,22 +51,22 @@ at index :math:`i` represents the bias values added to layer :math:`i+1`.
 
 The advantages of Multi-layer Perceptron are:
 
-    + Capability to learn non-linear models.
++ Capability to learn non-linear models.
 
-    + Capability to learn models in real-time (on-line learning)
-      using ``partial_fit``.
++ Capability to learn models in real-time (on-line learning)
+  using ``partial_fit``.
 
 
 The disadvantages of Multi-layer Perceptron (MLP) include:
 
-    + MLP with hidden layers have a non-convex loss function where there exists
-      more than one local minimum. Therefore different random weight
-      initializations can lead to different validation accuracy.
++ MLP with hidden layers have a non-convex loss function where there exists
+  more than one local minimum. Therefore different random weight
+  initializations can lead to different validation accuracy.
 
-    + MLP requires tuning a number of hyperparameters such as the number of
-      hidden neurons, layers, and iterations.
++ MLP requires tuning a number of hyperparameters such as the number of
+  hidden neurons, layers, and iterations.
 
-    + MLP is sensitive to feature scaling.
++ MLP is sensitive to feature scaling.
 
 Please see :ref:`Tips on Practical Use <mlp_tips>` section that addresses
 some of these disadvantages.
@@ -311,35 +311,35 @@ when the improvement in loss is below a certain, small number.
 Tips on Practical Use
 =====================
 
-  * Multi-layer Perceptron is sensitive to feature scaling, so it
-    is highly recommended to scale your data. For example, scale each
-    attribute on the input vector X to [0, 1] or [-1, +1], or standardize
-    it to have mean 0 and variance 1. Note that you must apply the *same*
-    scaling to the test set for meaningful results.
-    You can use :class:`~sklearn.preprocessing.StandardScaler` for standardization.
-
-      >>> from sklearn.preprocessing import StandardScaler  # doctest: +SKIP
-      >>> scaler = StandardScaler()  # doctest: +SKIP
-      >>> # Don't cheat - fit only on training data
-      >>> scaler.fit(X_train)  # doctest: +SKIP
-      >>> X_train = scaler.transform(X_train)  # doctest: +SKIP
-      >>> # apply same transformation to test data
-      >>> X_test = scaler.transform(X_test)  # doctest: +SKIP
-
-    An alternative and recommended approach is to use
-    :class:`~sklearn.preprocessing.StandardScaler` in a
-    :class:`~sklearn.pipeline.Pipeline`
-
-  * Finding a reasonable regularization parameter :math:`\alpha` is best done
-    using :class:`~sklearn.model_selection.GridSearchCV`, usually in the range
-    ``10.0 ** -np.arange(1, 7)``.
-
-  * Empirically, we observed that `L-BFGS` converges faster and
-    with better solutions on small datasets. For relatively large
-    datasets, however, `Adam` is very robust. It usually converges
-    quickly and gives pretty good performance. `SGD` with momentum or
-    nesterov's momentum, on the other hand, can perform better than
-    those two algorithms if learning rate is correctly tuned.
+* Multi-layer Perceptron is sensitive to feature scaling, so it
+  is highly recommended to scale your data. For example, scale each
+  attribute on the input vector X to [0, 1] or [-1, +1], or standardize
+  it to have mean 0 and variance 1. Note that you must apply the *same*
+  scaling to the test set for meaningful results.
+  You can use :class:`~sklearn.preprocessing.StandardScaler` for standardization.
+
+    >>> from sklearn.preprocessing import StandardScaler  # doctest: +SKIP
+    >>> scaler = StandardScaler()  # doctest: +SKIP
+    >>> # Don't cheat - fit only on training data
+    >>> scaler.fit(X_train)  # doctest: +SKIP
+    >>> X_train = scaler.transform(X_train)  # doctest: +SKIP
+    >>> # apply same transformation to test data
+    >>> X_test = scaler.transform(X_test)  # doctest: +SKIP
+
+  An alternative and recommended approach is to use
+  :class:`~sklearn.preprocessing.StandardScaler` in a
+  :class:`~sklearn.pipeline.Pipeline`
+
+* Finding a reasonable regularization parameter :math:`\alpha` is best done
+  using :class:`~sklearn.model_selection.GridSearchCV`, usually in the range
+  ``10.0 ** -np.arange(1, 7)``.
+
+* Empirically, we observed that `L-BFGS` converges faster and
+  with better solutions on small datasets. For relatively large
+  datasets, however, `Adam` is very robust. It usually converges
+  quickly and gives pretty good performance. `SGD` with momentum or
+  nesterov's momentum, on the other hand, can perform better than
+  those two algorithms if learning rate is correctly tuned.
 
 More control with warm_start
 ============================
diff --git a/doc/modules/outlier_detection.rst b/doc/modules/outlier_detection.rst
index 572674328108d..d003b645eb19c 100644
--- a/doc/modules/outlier_detection.rst
+++ b/doc/modules/outlier_detection.rst
@@ -411,7 +411,7 @@ Note that ``fit_predict`` is not available in this case to avoid inconsistencies
 
 Novelty detection with Local Outlier Factor is illustrated below.
 
-  .. figure:: ../auto_examples/neighbors/images/sphx_glr_plot_lof_novelty_detection_001.png
-     :target: ../auto_examples/neighbors/plot_lof_novelty_detection.html
-     :align: center
-     :scale: 75%
+.. figure:: ../auto_examples/neighbors/images/sphx_glr_plot_lof_novelty_detection_001.png
+    :target: ../auto_examples/neighbors/plot_lof_novelty_detection.html
+    :align: center
+    :scale: 75%
diff --git a/doc/modules/preprocessing.rst b/doc/modules/preprocessing.rst
index 475098c0d685c..b619b88110d63 100644
--- a/doc/modules/preprocessing.rst
+++ b/doc/modules/preprocessing.rst
@@ -1008,9 +1008,9 @@ For each feature, the bin edges are computed during ``fit`` and together with
 the number of bins, they will define the intervals. Therefore, for the current
 example, these intervals are defined as:
 
- - feature 1: :math:`{[-\infty, -1), [-1, 2), [2, \infty)}`
- - feature 2: :math:`{[-\infty, 5), [5, \infty)}`
- - feature 3: :math:`{[-\infty, 14), [14, \infty)}`
+- feature 1: :math:`{[-\infty, -1), [-1, 2), [2, \infty)}`
+- feature 2: :math:`{[-\infty, 5), [5, \infty)}`
+- feature 3: :math:`{[-\infty, 14), [14, \infty)}`
 
 Based on these bin intervals, ``X`` is transformed as follows::
 
@@ -1199,23 +1199,23 @@ below.
 
 Some of the advantages of splines over polynomials are:
 
-    - B-splines are very flexible and robust if you keep a fixed low degree,
-      usually 3, and parsimoniously adapt the number of knots. Polynomials
-      would need a higher degree, which leads to the next point.
-    - B-splines do not have oscillatory behaviour at the boundaries as have
-      polynomials (the higher the degree, the worse). This is known as `Runge's
-      phenomenon <https://en.wikipedia.org/wiki/Runge%27s_phenomenon>`_.
-    - B-splines provide good options for extrapolation beyond the boundaries,
-      i.e. beyond the range of fitted values. Have a look at the option
-      ``extrapolation``.
-    - B-splines generate a feature matrix with a banded structure. For a single
-      feature, every row contains only ``degree + 1`` non-zero elements, which
-      occur consecutively and are even positive. This results in a matrix with
-      good numerical properties, e.g. a low condition number, in sharp contrast
-      to a matrix of polynomials, which goes under the name
-      `Vandermonde matrix <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_.
-      A low condition number is important for stable algorithms of linear
-      models.
+- B-splines are very flexible and robust if you keep a fixed low degree,
+  usually 3, and parsimoniously adapt the number of knots. Polynomials
+  would need a higher degree, which leads to the next point.
+- B-splines do not have oscillatory behaviour at the boundaries as have
+  polynomials (the higher the degree, the worse). This is known as `Runge's
+  phenomenon <https://en.wikipedia.org/wiki/Runge%27s_phenomenon>`_.
+- B-splines provide good options for extrapolation beyond the boundaries,
+  i.e. beyond the range of fitted values. Have a look at the option
+  ``extrapolation``.
+- B-splines generate a feature matrix with a banded structure. For a single
+  feature, every row contains only ``degree + 1`` non-zero elements, which
+  occur consecutively and are even positive. This results in a matrix with
+  good numerical properties, e.g. a low condition number, in sharp contrast
+  to a matrix of polynomials, which goes under the name
+  `Vandermonde matrix <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_.
+  A low condition number is important for stable algorithms of linear
+  models.
 
 The following code snippet shows splines in action::
 
diff --git a/doc/modules/semi_supervised.rst b/doc/modules/semi_supervised.rst
index 47e8bfffdd9a7..f8cae0a9ddcdf 100644
--- a/doc/modules/semi_supervised.rst
+++ b/doc/modules/semi_supervised.rst
@@ -121,11 +121,11 @@ Label propagation models have two built-in kernel methods. Choice of kernel
 effects both scalability and performance of the algorithms. The following are
 available:
 
-  * rbf (:math:`\exp(-\gamma |x-y|^2), \gamma > 0`). :math:`\gamma` is
-    specified by keyword gamma.
+* rbf (:math:`\exp(-\gamma |x-y|^2), \gamma > 0`). :math:`\gamma` is
+  specified by keyword gamma.
 
-  * knn (:math:`1[x' \in kNN(x)]`). :math:`k` is specified by keyword
-    n_neighbors.
+* knn (:math:`1[x' \in kNN(x)]`). :math:`k` is specified by keyword
+  n_neighbors.
 
 The RBF kernel will produce a fully connected graph which is represented in memory
 by a dense matrix. This matrix may be very large and combined with the cost of
diff --git a/doc/modules/sgd.rst b/doc/modules/sgd.rst
index b37a0209af24d..a7981e9d4ec28 100644
--- a/doc/modules/sgd.rst
+++ b/doc/modules/sgd.rst
@@ -36,16 +36,16 @@ different means.
 
 The advantages of Stochastic Gradient Descent are:
 
-    + Efficiency.
++ Efficiency.
 
-    + Ease of implementation (lots of opportunities for code tuning).
++ Ease of implementation (lots of opportunities for code tuning).
 
 The disadvantages of Stochastic Gradient Descent include:
 
-    + SGD requires a number of hyperparameters such as the regularization
-      parameter and the number of iterations.
++ SGD requires a number of hyperparameters such as the regularization
+  parameter and the number of iterations.
 
-    + SGD is sensitive to feature scaling.
++ SGD is sensitive to feature scaling.
 
 .. warning::
 
@@ -111,12 +111,12 @@ the coefficients and the input sample, plus the intercept) is given by
 The concrete loss function can be set via the ``loss``
 parameter. :class:`SGDClassifier` supports the following loss functions:
 
-  * ``loss="hinge"``: (soft-margin) linear Support Vector Machine,
-  * ``loss="modified_huber"``: smoothed hinge loss,
-  * ``loss="log_loss"``: logistic regression,
-  * and all regression losses below. In this case the target is encoded as -1
-    or 1, and the problem is treated as a regression problem. The predicted
-    class then correspond to the sign of the predicted target.
+* ``loss="hinge"``: (soft-margin) linear Support Vector Machine,
+* ``loss="modified_huber"``: smoothed hinge loss,
+* ``loss="log_loss"``: logistic regression,
+* and all regression losses below. In this case the target is encoded as -1
+  or 1, and the problem is treated as a regression problem. The predicted
+  class then correspond to the sign of the predicted target.
 
 Please refer to the :ref:`mathematical section below
 <sgd_mathematical_formulation>` for formulas.
@@ -136,10 +136,10 @@ Using ``loss="log_loss"`` or ``loss="modified_huber"`` enables the
 The concrete penalty can be set via the ``penalty`` parameter.
 SGD supports the following penalties:
 
-  * ``penalty="l2"``: L2 norm penalty on ``coef_``.
-  * ``penalty="l1"``: L1 norm penalty on ``coef_``.
-  * ``penalty="elasticnet"``: Convex combination of L2 and L1;
-    ``(1 - l1_ratio) * L2 + l1_ratio * L1``.
+* ``penalty="l2"``: L2 norm penalty on ``coef_``.
+* ``penalty="l1"``: L1 norm penalty on ``coef_``.
+* ``penalty="elasticnet"``: Convex combination of L2 and L1;
+  ``(1 - l1_ratio) * L2 + l1_ratio * L1``.
 
 The default setting is ``penalty="l2"``. The L1 penalty leads to sparse
 solutions, driving most coefficients to zero. The Elastic Net [#5]_ solves
@@ -211,9 +211,9 @@ samples (> 10.000), for other problems we recommend :class:`Ridge`,
 The concrete loss function can be set via the ``loss``
 parameter. :class:`SGDRegressor` supports the following loss functions:
 
-  * ``loss="squared_error"``: Ordinary least squares,
-  * ``loss="huber"``: Huber loss for robust regression,
-  * ``loss="epsilon_insensitive"``: linear Support Vector Regression.
+* ``loss="squared_error"``: Ordinary least squares,
+* ``loss="huber"``: Huber loss for robust regression,
+* ``loss="epsilon_insensitive"``: linear Support Vector Regression.
 
 Please refer to the :ref:`mathematical section below
 <sgd_mathematical_formulation>` for formulas.
@@ -327,14 +327,14 @@ Stopping criterion
 The classes :class:`SGDClassifier` and :class:`SGDRegressor` provide two
 criteria to stop the algorithm when a given level of convergence is reached:
 
-  * With ``early_stopping=True``, the input data is split into a training set
-    and a validation set. The model is then fitted on the training set, and the
-    stopping criterion is based on the prediction score (using the `score`
-    method) computed on the validation set. The size of the validation set
-    can be changed with the parameter ``validation_fraction``.
-  * With ``early_stopping=False``, the model is fitted on the entire input data
-    and the stopping criterion is based on the objective function computed on
-    the training data.
+* With ``early_stopping=True``, the input data is split into a training set
+  and a validation set. The model is then fitted on the training set, and the
+  stopping criterion is based on the prediction score (using the `score`
+  method) computed on the validation set. The size of the validation set
+  can be changed with the parameter ``validation_fraction``.
+* With ``early_stopping=False``, the model is fitted on the entire input data
+  and the stopping criterion is based on the objective function computed on
+  the training data.
 
 In both cases, the criterion is evaluated once by epoch, and the algorithm stops
 when the criterion does not improve ``n_iter_no_change`` times in a row. The
@@ -345,45 +345,45 @@ stops in any case after a maximum number of iteration ``max_iter``.
 Tips on Practical Use
 =====================
 
-  * Stochastic Gradient Descent is sensitive to feature scaling, so it
-    is highly recommended to scale your data. For example, scale each
-    attribute on the input vector X to [0,1] or [-1,+1], or standardize
-    it to have mean 0 and variance 1. Note that the *same* scaling must be
-    applied to the test vector to obtain meaningful results. This can be easily
-    done using :class:`~sklearn.preprocessing.StandardScaler`::
-
-      from sklearn.preprocessing import StandardScaler
-      scaler = StandardScaler()
-      scaler.fit(X_train)  # Don't cheat - fit only on training data
-      X_train = scaler.transform(X_train)
-      X_test = scaler.transform(X_test)  # apply same transformation to test data
-
-      # Or better yet: use a pipeline!
-      from sklearn.pipeline import make_pipeline
-      est = make_pipeline(StandardScaler(), SGDClassifier())
-      est.fit(X_train)
-      est.predict(X_test)
-
-    If your attributes have an intrinsic scale (e.g. word frequencies or
-    indicator features) scaling is not needed.
-
-  * Finding a reasonable regularization term :math:`\alpha` is
-    best done using automatic hyper-parameter search, e.g.
-    :class:`~sklearn.model_selection.GridSearchCV` or
-    :class:`~sklearn.model_selection.RandomizedSearchCV`, usually in the
-    range ``10.0**-np.arange(1,7)``.
-
-  * Empirically, we found that SGD converges after observing
-    approximately 10^6 training samples. Thus, a reasonable first guess
-    for the number of iterations is ``max_iter = np.ceil(10**6 / n)``,
-    where ``n`` is the size of the training set.
-
-  * If you apply SGD to features extracted using PCA we found that
-    it is often wise to scale the feature values by some constant `c`
-    such that the average L2 norm of the training data equals one.
-
-  * We found that Averaged SGD works best with a larger number of features
-    and a higher eta0
+* Stochastic Gradient Descent is sensitive to feature scaling, so it
+  is highly recommended to scale your data. For example, scale each
+  attribute on the input vector X to [0,1] or [-1,+1], or standardize
+  it to have mean 0 and variance 1. Note that the *same* scaling must be
+  applied to the test vector to obtain meaningful results. This can be easily
+  done using :class:`~sklearn.preprocessing.StandardScaler`::
+
+    from sklearn.preprocessing import StandardScaler
+    scaler = StandardScaler()
+    scaler.fit(X_train)  # Don't cheat - fit only on training data
+    X_train = scaler.transform(X_train)
+    X_test = scaler.transform(X_test)  # apply same transformation to test data
+
+    # Or better yet: use a pipeline!
+    from sklearn.pipeline import make_pipeline
+    est = make_pipeline(StandardScaler(), SGDClassifier())
+    est.fit(X_train)
+    est.predict(X_test)
+
+  If your attributes have an intrinsic scale (e.g. word frequencies or
+  indicator features) scaling is not needed.
+
+* Finding a reasonable regularization term :math:`\alpha` is
+  best done using automatic hyper-parameter search, e.g.
+  :class:`~sklearn.model_selection.GridSearchCV` or
+  :class:`~sklearn.model_selection.RandomizedSearchCV`, usually in the
+  range ``10.0**-np.arange(1,7)``.
+
+* Empirically, we found that SGD converges after observing
+  approximately 10^6 training samples. Thus, a reasonable first guess
+  for the number of iterations is ``max_iter = np.ceil(10**6 / n)``,
+  where ``n`` is the size of the training set.
+
+* If you apply SGD to features extracted using PCA we found that
+  it is often wise to scale the feature values by some constant `c`
+  such that the average L2 norm of the training data equals one.
+
+* We found that Averaged SGD works best with a larger number of features
+  and a higher eta0.
 
 .. topic:: References:
 
@@ -454,12 +454,12 @@ misclassification error (Zero-one loss) as shown in the Figure below.
 Popular choices for the regularization term :math:`R` (the `penalty`
 parameter) include:
 
-   - L2 norm: :math:`R(w) := \frac{1}{2} \sum_{j=1}^{m} w_j^2 = ||w||_2^2`,
-   - L1 norm: :math:`R(w) := \sum_{j=1}^{m} |w_j|`, which leads to sparse
-     solutions.
-   - Elastic Net: :math:`R(w) := \frac{\rho}{2} \sum_{j=1}^{n} w_j^2 +
-     (1-\rho) \sum_{j=1}^{m} |w_j|`, a convex combination of L2 and L1, where
-     :math:`\rho` is given by ``1 - l1_ratio``.
+- L2 norm: :math:`R(w) := \frac{1}{2} \sum_{j=1}^{m} w_j^2 = ||w||_2^2`,
+- L1 norm: :math:`R(w) := \sum_{j=1}^{m} |w_j|`, which leads to sparse
+  solutions.
+- Elastic Net: :math:`R(w) := \frac{\rho}{2} \sum_{j=1}^{n} w_j^2 +
+  (1-\rho) \sum_{j=1}^{m} |w_j|`, a convex combination of L2 and L1, where
+  :math:`\rho` is given by ``1 - l1_ratio``.
 
 The Figure below shows the contours of the different regularization terms
 in a 2-dimensional parameter space (:math:`m=2`) when :math:`R(w) = 1`.
diff --git a/doc/modules/svm.rst b/doc/modules/svm.rst
index 1a8b6d6c5741e..06eee7de50855 100644
--- a/doc/modules/svm.rst
+++ b/doc/modules/svm.rst
@@ -16,27 +16,27 @@ methods used for :ref:`classification <svm_classification>`,
 
 The advantages of support vector machines are:
 
-    - Effective in high dimensional spaces.
+- Effective in high dimensional spaces.
 
-    - Still effective in cases where number of dimensions is greater
-      than the number of samples.
+- Still effective in cases where number of dimensions is greater
+  than the number of samples.
 
-    - Uses a subset of training points in the decision function (called
-      support vectors), so it is also memory efficient.
+- Uses a subset of training points in the decision function (called
+  support vectors), so it is also memory efficient.
 
-    - Versatile: different :ref:`svm_kernels` can be
-      specified for the decision function. Common kernels are
-      provided, but it is also possible to specify custom kernels.
+- Versatile: different :ref:`svm_kernels` can be
+  specified for the decision function. Common kernels are
+  provided, but it is also possible to specify custom kernels.
 
 The disadvantages of support vector machines include:
 
-    - If the number of features is much greater than the number of
-      samples, avoid over-fitting in choosing :ref:`svm_kernels` and regularization
-      term is crucial.
+- If the number of features is much greater than the number of
+  samples, avoid over-fitting in choosing :ref:`svm_kernels` and regularization
+  term is crucial.
 
-    - SVMs do not directly provide probability estimates, these are
-      calculated using an expensive five-fold cross-validation
-      (see :ref:`Scores and probabilities <scores_probabilities>`, below).
+- SVMs do not directly provide probability estimates, these are
+  calculated using an expensive five-fold cross-validation
+  (see :ref:`Scores and probabilities <scores_probabilities>`, below).
 
 The support vector machines in scikit-learn support both dense
 (``numpy.ndarray`` and convertible to that by ``numpy.asarray``) and
@@ -381,95 +381,95 @@ Tips on Practical Use
 =====================
 
 
-  * **Avoiding data copy**: For :class:`SVC`, :class:`SVR`, :class:`NuSVC` and
-    :class:`NuSVR`, if the data passed to certain methods is not C-ordered
-    contiguous and double precision, it will be copied before calling the
-    underlying C implementation. You can check whether a given numpy array is
-    C-contiguous by inspecting its ``flags`` attribute.
-
-    For :class:`LinearSVC` (and :class:`LogisticRegression
-    <sklearn.linear_model.LogisticRegression>`) any input passed as a numpy
-    array will be copied and converted to the `liblinear`_ internal sparse data
-    representation (double precision floats and int32 indices of non-zero
-    components). If you want to fit a large-scale linear classifier without
-    copying a dense numpy C-contiguous double precision array as input, we
-    suggest to use the :class:`SGDClassifier
-    <sklearn.linear_model.SGDClassifier>` class instead.  The objective
-    function can be configured to be almost the same as the :class:`LinearSVC`
-    model.
-
-  * **Kernel cache size**: For :class:`SVC`, :class:`SVR`, :class:`NuSVC` and
-    :class:`NuSVR`, the size of the kernel cache has a strong impact on run
-    times for larger problems.  If you have enough RAM available, it is
-    recommended to set ``cache_size`` to a higher value than the default of
-    200(MB), such as 500(MB) or 1000(MB).
-
-
-  * **Setting C**: ``C`` is ``1`` by default and it's a reasonable default
-    choice.  If you have a lot of noisy observations you should decrease it:
-    decreasing C corresponds to more regularization.
-
-    :class:`LinearSVC` and :class:`LinearSVR` are less sensitive to ``C`` when
-    it becomes large, and prediction results stop improving after a certain
-    threshold. Meanwhile, larger ``C`` values will take more time to train,
-    sometimes up to 10 times longer, as shown in [#3]_.
-
-  * Support Vector Machine algorithms are not scale invariant, so **it
-    is highly recommended to scale your data**. For example, scale each
-    attribute on the input vector X to [0,1] or [-1,+1], or standardize it
-    to have mean 0 and variance 1. Note that the *same* scaling must be
-    applied to the test vector to obtain meaningful results. This can be done
-    easily by using a :class:`~sklearn.pipeline.Pipeline`::
-
-        >>> from sklearn.pipeline import make_pipeline
-        >>> from sklearn.preprocessing import StandardScaler
-        >>> from sklearn.svm import SVC
-
-        >>> clf = make_pipeline(StandardScaler(), SVC())
-
-    See section :ref:`preprocessing` for more details on scaling and
-    normalization.
-
-  .. _shrinking_svm:
-
-  * Regarding the `shrinking` parameter, quoting [#4]_: *We found that if the
-    number of iterations is large, then shrinking can shorten the training
-    time. However, if we loosely solve the optimization problem (e.g., by
-    using a large stopping tolerance), the code without using shrinking may
-    be much faster*
-
-  * Parameter ``nu`` in :class:`NuSVC`/:class:`OneClassSVM`/:class:`NuSVR`
-    approximates the fraction of training errors and support vectors.
-
-  * In :class:`SVC`, if the data is unbalanced (e.g. many
-    positive and few negative), set ``class_weight='balanced'`` and/or try
-    different penalty parameters ``C``.
-
-  * **Randomness of the underlying implementations**: The underlying
-    implementations of :class:`SVC` and :class:`NuSVC` use a random number
-    generator only to shuffle the data for probability estimation (when
-    ``probability`` is set to ``True``). This randomness can be controlled
-    with the ``random_state`` parameter. If ``probability`` is set to ``False``
-    these estimators are not random and ``random_state`` has no effect on the
-    results. The underlying :class:`OneClassSVM` implementation is similar to
-    the ones of :class:`SVC` and :class:`NuSVC`. As no probability estimation
-    is provided for :class:`OneClassSVM`, it is not random.
-
-    The underlying :class:`LinearSVC` implementation uses a random number
-    generator to select features when fitting the model with a dual coordinate
-    descent (i.e. when ``dual`` is set to ``True``). It is thus not uncommon
-    to have slightly different results for the same input data. If that
-    happens, try with a smaller `tol` parameter. This randomness can also be
-    controlled with the ``random_state`` parameter. When ``dual`` is
-    set to ``False`` the underlying implementation of :class:`LinearSVC` is
-    not random and ``random_state`` has no effect on the results.
-
-  * Using L1 penalization as provided by ``LinearSVC(penalty='l1',
-    dual=False)`` yields a sparse solution, i.e. only a subset of feature
-    weights is different from zero and contribute to the decision function.
-    Increasing ``C`` yields a more complex model (more features are selected).
-    The ``C`` value that yields a "null" model (all weights equal to zero) can
-    be calculated using :func:`l1_min_c`.
+* **Avoiding data copy**: For :class:`SVC`, :class:`SVR`, :class:`NuSVC` and
+  :class:`NuSVR`, if the data passed to certain methods is not C-ordered
+  contiguous and double precision, it will be copied before calling the
+  underlying C implementation. You can check whether a given numpy array is
+  C-contiguous by inspecting its ``flags`` attribute.
+
+  For :class:`LinearSVC` (and :class:`LogisticRegression
+  <sklearn.linear_model.LogisticRegression>`) any input passed as a numpy
+  array will be copied and converted to the `liblinear`_ internal sparse data
+  representation (double precision floats and int32 indices of non-zero
+  components). If you want to fit a large-scale linear classifier without
+  copying a dense numpy C-contiguous double precision array as input, we
+  suggest to use the :class:`SGDClassifier
+  <sklearn.linear_model.SGDClassifier>` class instead.  The objective
+  function can be configured to be almost the same as the :class:`LinearSVC`
+  model.
+
+* **Kernel cache size**: For :class:`SVC`, :class:`SVR`, :class:`NuSVC` and
+  :class:`NuSVR`, the size of the kernel cache has a strong impact on run
+  times for larger problems.  If you have enough RAM available, it is
+  recommended to set ``cache_size`` to a higher value than the default of
+  200(MB), such as 500(MB) or 1000(MB).
+
+
+* **Setting C**: ``C`` is ``1`` by default and it's a reasonable default
+  choice.  If you have a lot of noisy observations you should decrease it:
+  decreasing C corresponds to more regularization.
+
+  :class:`LinearSVC` and :class:`LinearSVR` are less sensitive to ``C`` when
+  it becomes large, and prediction results stop improving after a certain
+  threshold. Meanwhile, larger ``C`` values will take more time to train,
+  sometimes up to 10 times longer, as shown in [#3]_.
+
+* Support Vector Machine algorithms are not scale invariant, so **it
+  is highly recommended to scale your data**. For example, scale each
+  attribute on the input vector X to [0,1] or [-1,+1], or standardize it
+  to have mean 0 and variance 1. Note that the *same* scaling must be
+  applied to the test vector to obtain meaningful results. This can be done
+  easily by using a :class:`~sklearn.pipeline.Pipeline`::
+
+      >>> from sklearn.pipeline import make_pipeline
+      >>> from sklearn.preprocessing import StandardScaler
+      >>> from sklearn.svm import SVC
+
+      >>> clf = make_pipeline(StandardScaler(), SVC())
+
+  See section :ref:`preprocessing` for more details on scaling and
+  normalization.
+
+.. _shrinking_svm:
+
+* Regarding the `shrinking` parameter, quoting [#4]_: *We found that if the
+  number of iterations is large, then shrinking can shorten the training
+  time. However, if we loosely solve the optimization problem (e.g., by
+  using a large stopping tolerance), the code without using shrinking may
+  be much faster*
+
+* Parameter ``nu`` in :class:`NuSVC`/:class:`OneClassSVM`/:class:`NuSVR`
+  approximates the fraction of training errors and support vectors.
+
+* In :class:`SVC`, if the data is unbalanced (e.g. many
+  positive and few negative), set ``class_weight='balanced'`` and/or try
+  different penalty parameters ``C``.
+
+* **Randomness of the underlying implementations**: The underlying
+  implementations of :class:`SVC` and :class:`NuSVC` use a random number
+  generator only to shuffle the data for probability estimation (when
+  ``probability`` is set to ``True``). This randomness can be controlled
+  with the ``random_state`` parameter. If ``probability`` is set to ``False``
+  these estimators are not random and ``random_state`` has no effect on the
+  results. The underlying :class:`OneClassSVM` implementation is similar to
+  the ones of :class:`SVC` and :class:`NuSVC`. As no probability estimation
+  is provided for :class:`OneClassSVM`, it is not random.
+
+  The underlying :class:`LinearSVC` implementation uses a random number
+  generator to select features when fitting the model with a dual coordinate
+  descent (i.e. when ``dual`` is set to ``True``). It is thus not uncommon
+  to have slightly different results for the same input data. If that
+  happens, try with a smaller `tol` parameter. This randomness can also be
+  controlled with the ``random_state`` parameter. When ``dual`` is
+  set to ``False`` the underlying implementation of :class:`LinearSVC` is
+  not random and ``random_state`` has no effect on the results.
+
+* Using L1 penalization as provided by ``LinearSVC(penalty='l1',
+  dual=False)`` yields a sparse solution, i.e. only a subset of feature
+  weights is different from zero and contribute to the decision function.
+  Increasing ``C`` yields a more complex model (more features are selected).
+  The ``C`` value that yields a "null" model (all weights equal to zero) can
+  be calculated using :func:`l1_min_c`.
 
 
 .. _svm_kernels:
@@ -479,16 +479,16 @@ Kernel functions
 
 The *kernel function* can be any of the following:
 
-  * linear: :math:`\langle x, x'\rangle`.
+* linear: :math:`\langle x, x'\rangle`.
 
-  * polynomial: :math:`(\gamma \langle x, x'\rangle + r)^d`, where
-    :math:`d` is specified by parameter ``degree``, :math:`r` by ``coef0``.
+* polynomial: :math:`(\gamma \langle x, x'\rangle + r)^d`, where
+  :math:`d` is specified by parameter ``degree``, :math:`r` by ``coef0``.
 
-  * rbf: :math:`\exp(-\gamma \|x-x'\|^2)`, where :math:`\gamma` is
-    specified by parameter ``gamma``, must be greater than 0.
+* rbf: :math:`\exp(-\gamma \|x-x'\|^2)`, where :math:`\gamma` is
+  specified by parameter ``gamma``, must be greater than 0.
 
-  * sigmoid :math:`\tanh(\gamma \langle x,x'\rangle + r)`,
-    where :math:`r` is specified by ``coef0``.
+* sigmoid :math:`\tanh(\gamma \langle x,x'\rangle + r)`,
+  where :math:`r` is specified by ``coef0``.
 
 Different kernels are specified by the `kernel` parameter::
 
@@ -530,12 +530,12 @@ python function or by precomputing the Gram matrix.
 Classifiers with custom kernels behave the same way as any other
 classifiers, except that:
 
-    * Field ``support_vectors_`` is now empty, only indices of support
-      vectors are stored in ``support_``
+* Field ``support_vectors_`` is now empty, only indices of support
+  vectors are stored in ``support_``
 
-    * A reference (and not a copy) of the first argument in the ``fit()``
-      method is stored for future reference. If that array changes between the
-      use of ``fit()`` and ``predict()`` you will have unexpected results.
+* A reference (and not a copy) of the first argument in the ``fit()``
+  method is stored for future reference. If that array changes between the
+  use of ``fit()`` and ``predict()`` you will have unexpected results.
 
 
 |details-start|
diff --git a/doc/modules/tree.rst b/doc/modules/tree.rst
index e0a55547f4dea..b54b913573a34 100644
--- a/doc/modules/tree.rst
+++ b/doc/modules/tree.rst
@@ -23,68 +23,68 @@ the tree, the more complex the decision rules and the fitter the model.
 
 Some advantages of decision trees are:
 
-    - Simple to understand and to interpret. Trees can be visualized.
+- Simple to understand and to interpret. Trees can be visualized.
 
-    - Requires little data preparation. Other techniques often require data
-      normalization, dummy variables need to be created and blank values to
-      be removed. Some tree and algorithm combinations support
-      :ref:`missing values <tree_missing_value_support>`.
+- Requires little data preparation. Other techniques often require data
+  normalization, dummy variables need to be created and blank values to
+  be removed. Some tree and algorithm combinations support
+  :ref:`missing values <tree_missing_value_support>`.
 
-    - The cost of using the tree (i.e., predicting data) is logarithmic in the
-      number of data points used to train the tree.
+- The cost of using the tree (i.e., predicting data) is logarithmic in the
+  number of data points used to train the tree.
 
-    - Able to handle both numerical and categorical data. However, the scikit-learn
-      implementation does not support categorical variables for now. Other
-      techniques are usually specialized in analyzing datasets that have only one type
-      of variable. See :ref:`algorithms <tree_algorithms>` for more
-      information.
+- Able to handle both numerical and categorical data. However, the scikit-learn
+  implementation does not support categorical variables for now. Other
+  techniques are usually specialized in analyzing datasets that have only one type
+  of variable. See :ref:`algorithms <tree_algorithms>` for more
+  information.
 
-    - Able to handle multi-output problems.
+- Able to handle multi-output problems.
 
-    - Uses a white box model. If a given situation is observable in a model,
-      the explanation for the condition is easily explained by boolean logic.
-      By contrast, in a black box model (e.g., in an artificial neural
-      network), results may be more difficult to interpret.
+- Uses a white box model. If a given situation is observable in a model,
+  the explanation for the condition is easily explained by boolean logic.
+  By contrast, in a black box model (e.g., in an artificial neural
+  network), results may be more difficult to interpret.
 
-    - Possible to validate a model using statistical tests. That makes it
-      possible to account for the reliability of the model.
+- Possible to validate a model using statistical tests. That makes it
+  possible to account for the reliability of the model.
 
-    - Performs well even if its assumptions are somewhat violated by
-      the true model from which the data were generated.
+- Performs well even if its assumptions are somewhat violated by
+  the true model from which the data were generated.
 
 
 The disadvantages of decision trees include:
 
-    - Decision-tree learners can create over-complex trees that do not
-      generalize the data well. This is called overfitting. Mechanisms
-      such as pruning, setting the minimum number of samples required
-      at a leaf node or setting the maximum depth of the tree are
-      necessary to avoid this problem.
+- Decision-tree learners can create over-complex trees that do not
+  generalize the data well. This is called overfitting. Mechanisms
+  such as pruning, setting the minimum number of samples required
+  at a leaf node or setting the maximum depth of the tree are
+  necessary to avoid this problem.
 
-    - Decision trees can be unstable because small variations in the
-      data might result in a completely different tree being generated.
-      This problem is mitigated by using decision trees within an
-      ensemble.
+- Decision trees can be unstable because small variations in the
+  data might result in a completely different tree being generated.
+  This problem is mitigated by using decision trees within an
+  ensemble.
 
-    - Predictions of decision trees are neither smooth nor continuous, but
-      piecewise constant approximations as seen in the above figure. Therefore,
-      they are not good at extrapolation.
+- Predictions of decision trees are neither smooth nor continuous, but
+  piecewise constant approximations as seen in the above figure. Therefore,
+  they are not good at extrapolation.
 
-    - The problem of learning an optimal decision tree is known to be
-      NP-complete under several aspects of optimality and even for simple
-      concepts. Consequently, practical decision-tree learning algorithms
-      are based on heuristic algorithms such as the greedy algorithm where
-      locally optimal decisions are made at each node. Such algorithms
-      cannot guarantee to return the globally optimal decision tree.  This
-      can be mitigated by training multiple trees in an ensemble learner,
-      where the features and samples are randomly sampled with replacement.
+- The problem of learning an optimal decision tree is known to be
+  NP-complete under several aspects of optimality and even for simple
+  concepts. Consequently, practical decision-tree learning algorithms
+  are based on heuristic algorithms such as the greedy algorithm where
+  locally optimal decisions are made at each node. Such algorithms
+  cannot guarantee to return the globally optimal decision tree.  This
+  can be mitigated by training multiple trees in an ensemble learner,
+  where the features and samples are randomly sampled with replacement.
 
-    - There are concepts that are hard to learn because decision trees
-      do not express them easily, such as XOR, parity or multiplexer problems.
+- There are concepts that are hard to learn because decision trees
+  do not express them easily, such as XOR, parity or multiplexer problems.
 
-    - Decision tree learners create biased trees if some classes dominate.
-      It is therefore recommended to balance the dataset prior to fitting
-      with the decision tree.
+- Decision tree learners create biased trees if some classes dominate.
+  It is therefore recommended to balance the dataset prior to fitting
+  with the decision tree.
 
 
 .. _tree_classification:
@@ -273,19 +273,19 @@ generalization accuracy of the resulting estimator may often be increased.
 With regard to decision trees, this strategy can readily be used to support
 multi-output problems. This requires the following changes:
 
-  - Store n output values in leaves, instead of 1;
-  - Use splitting criteria that compute the average reduction across all
-    n outputs.
+- Store n output values in leaves, instead of 1;
+- Use splitting criteria that compute the average reduction across all
+  n outputs.
 
 This module offers support for multi-output problems by implementing this
 strategy in both :class:`DecisionTreeClassifier` and
 :class:`DecisionTreeRegressor`. If a decision tree is fit on an output array Y
 of shape ``(n_samples, n_outputs)`` then the resulting estimator will:
 
-  * Output n_output values upon ``predict``;
+* Output n_output values upon ``predict``;
 
-  * Output a list of n_output arrays of class probabilities upon
-    ``predict_proba``.
+* Output a list of n_output arrays of class probabilities upon
+  ``predict_proba``.
 
 The use of multi-output trees for regression is demonstrated in
 :ref:`sphx_glr_auto_examples_tree_plot_tree_regression_multioutput.py`. In this example, the input
@@ -315,10 +315,10 @@ the lower half of those faces.
 **References**
 |details-split|
 
- * M. Dumont et al,  `Fast multi-class image annotation with random subwindows
-   and multiple output randomized trees
-   <http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2009/DMWG09/dumont-visapp09-shortpaper.pdf>`_, International Conference on
-   Computer Vision Theory and Applications 2009
+* M. Dumont et al,  `Fast multi-class image annotation with random subwindows
+  and multiple output randomized trees
+  <http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2009/DMWG09/dumont-visapp09-shortpaper.pdf>`_, International Conference on
+  Computer Vision Theory and Applications 2009
 
 |details-end|
 
@@ -343,65 +343,65 @@ total cost over the entire trees (by summing the cost at each node) of
 Tips on practical use
 =====================
 
-  * Decision trees tend to overfit on data with a large number of features.
-    Getting the right ratio of samples to number of features is important, since
-    a tree with few samples in high dimensional space is very likely to overfit.
-
-  * Consider performing  dimensionality reduction (:ref:`PCA <PCA>`,
-    :ref:`ICA <ICA>`, or :ref:`feature_selection`) beforehand to
-    give your tree a better chance of finding features that are discriminative.
-
-  * :ref:`sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py` will help
-    in gaining more insights about how the decision tree makes predictions, which is
-    important for understanding the important features in the data.
-
-  * Visualize your tree as you are training by using the ``export``
-    function.  Use ``max_depth=3`` as an initial tree depth to get a feel for
-    how the tree is fitting to your data, and then increase the depth.
-
-  * Remember that the number of samples required to populate the tree doubles
-    for each additional level the tree grows to.  Use ``max_depth`` to control
-    the size of the tree to prevent overfitting.
-
-  * Use ``min_samples_split`` or ``min_samples_leaf`` to ensure that multiple
-    samples inform every decision in the tree, by controlling which splits will
-    be considered. A very small number will usually mean the tree will overfit,
-    whereas a large number will prevent the tree from learning the data. Try
-    ``min_samples_leaf=5`` as an initial value. If the sample size varies
-    greatly, a float number can be used as percentage in these two parameters.
-    While ``min_samples_split`` can create arbitrarily small leaves,
-    ``min_samples_leaf`` guarantees that each leaf has a minimum size, avoiding
-    low-variance, over-fit leaf nodes in regression problems.  For
-    classification with few classes, ``min_samples_leaf=1`` is often the best
-    choice.
-
-    Note that ``min_samples_split`` considers samples directly and independent of
-    ``sample_weight``, if provided (e.g. a node with m weighted samples is still
-    treated as having exactly m samples). Consider ``min_weight_fraction_leaf`` or
-    ``min_impurity_decrease`` if accounting for sample weights is required at splits.
-
-  * Balance your dataset before training to prevent the tree from being biased
-    toward the classes that are dominant. Class balancing can be done by
-    sampling an equal number of samples from each class, or preferably by
-    normalizing the sum of the sample weights (``sample_weight``) for each
-    class to the same value. Also note that weight-based pre-pruning criteria,
-    such as ``min_weight_fraction_leaf``, will then be less biased toward
-    dominant classes than criteria that are not aware of the sample weights,
-    like ``min_samples_leaf``.
-
-  * If the samples are weighted, it will be easier to optimize the tree
-    structure using weight-based pre-pruning criterion such as
-    ``min_weight_fraction_leaf``, which ensure that leaf nodes contain at least
-    a fraction of the overall sum of the sample weights.
-
-  * All decision trees use ``np.float32`` arrays internally.
-    If training data is not in this format, a copy of the dataset will be made.
-
-  * If the input matrix X is very sparse, it is recommended to convert to sparse
-    ``csc_matrix`` before calling fit and sparse ``csr_matrix`` before calling
-    predict. Training time can be orders of magnitude faster for a sparse
-    matrix input compared to a dense matrix when features have zero values in
-    most of the samples.
+* Decision trees tend to overfit on data with a large number of features.
+  Getting the right ratio of samples to number of features is important, since
+  a tree with few samples in high dimensional space is very likely to overfit.
+
+* Consider performing  dimensionality reduction (:ref:`PCA <PCA>`,
+  :ref:`ICA <ICA>`, or :ref:`feature_selection`) beforehand to
+  give your tree a better chance of finding features that are discriminative.
+
+* :ref:`sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py` will help
+  in gaining more insights about how the decision tree makes predictions, which is
+  important for understanding the important features in the data.
+
+* Visualize your tree as you are training by using the ``export``
+  function.  Use ``max_depth=3`` as an initial tree depth to get a feel for
+  how the tree is fitting to your data, and then increase the depth.
+
+* Remember that the number of samples required to populate the tree doubles
+  for each additional level the tree grows to.  Use ``max_depth`` to control
+  the size of the tree to prevent overfitting.
+
+* Use ``min_samples_split`` or ``min_samples_leaf`` to ensure that multiple
+  samples inform every decision in the tree, by controlling which splits will
+  be considered. A very small number will usually mean the tree will overfit,
+  whereas a large number will prevent the tree from learning the data. Try
+  ``min_samples_leaf=5`` as an initial value. If the sample size varies
+  greatly, a float number can be used as percentage in these two parameters.
+  While ``min_samples_split`` can create arbitrarily small leaves,
+  ``min_samples_leaf`` guarantees that each leaf has a minimum size, avoiding
+  low-variance, over-fit leaf nodes in regression problems.  For
+  classification with few classes, ``min_samples_leaf=1`` is often the best
+  choice.
+
+  Note that ``min_samples_split`` considers samples directly and independent of
+  ``sample_weight``, if provided (e.g. a node with m weighted samples is still
+  treated as having exactly m samples). Consider ``min_weight_fraction_leaf`` or
+  ``min_impurity_decrease`` if accounting for sample weights is required at splits.
+
+* Balance your dataset before training to prevent the tree from being biased
+  toward the classes that are dominant. Class balancing can be done by
+  sampling an equal number of samples from each class, or preferably by
+  normalizing the sum of the sample weights (``sample_weight``) for each
+  class to the same value. Also note that weight-based pre-pruning criteria,
+  such as ``min_weight_fraction_leaf``, will then be less biased toward
+  dominant classes than criteria that are not aware of the sample weights,
+  like ``min_samples_leaf``.
+
+* If the samples are weighted, it will be easier to optimize the tree
+  structure using weight-based pre-pruning criterion such as
+  ``min_weight_fraction_leaf``, which ensure that leaf nodes contain at least
+  a fraction of the overall sum of the sample weights.
+
+* All decision trees use ``np.float32`` arrays internally.
+  If training data is not in this format, a copy of the dataset will be made.
+
+* If the input matrix X is very sparse, it is recommended to convert to sparse
+  ``csc_matrix`` before calling fit and sparse ``csr_matrix`` before calling
+  predict. Training time can be orders of magnitude faster for a sparse
+  matrix input compared to a dense matrix when features have zero values in
+  most of the samples.
 
 
 .. _tree_algorithms:
@@ -516,36 +516,36 @@ Log Loss or Entropy:
     H(Q_m) = - \sum_k p_{mk} \log(p_{mk})
 
 |details-start|
-Shannon entropy:
+**Shannon entropy**
 |details-split|
 
-  The entropy criterion computes the Shannon entropy of the possible classes. It
-  takes the class frequencies of the training data points that reached a given
-  leaf :math:`m` as their probability. Using the **Shannon entropy as tree node
-  splitting criterion is equivalent to minimizing the log loss** (also known as
-  cross-entropy and multinomial deviance) between the true labels :math:`y_i`
-  and the probabilistic predictions :math:`T_k(x_i)` of the tree model :math:`T` for class :math:`k`.
+The entropy criterion computes the Shannon entropy of the possible classes. It
+takes the class frequencies of the training data points that reached a given
+leaf :math:`m` as their probability. Using the **Shannon entropy as tree node
+splitting criterion is equivalent to minimizing the log loss** (also known as
+cross-entropy and multinomial deviance) between the true labels :math:`y_i`
+and the probabilistic predictions :math:`T_k(x_i)` of the tree model :math:`T` for class :math:`k`.
 
-  To see this, first recall that the log loss of a tree model :math:`T`
-  computed on a dataset :math:`D` is defined as follows:
+To see this, first recall that the log loss of a tree model :math:`T`
+computed on a dataset :math:`D` is defined as follows:
 
-  .. math::
+.. math::
 
-      \mathrm{LL}(D, T) = -\frac{1}{n} \sum_{(x_i, y_i) \in D} \sum_k I(y_i = k) \log(T_k(x_i))
+    \mathrm{LL}(D, T) = -\frac{1}{n} \sum_{(x_i, y_i) \in D} \sum_k I(y_i = k) \log(T_k(x_i))
 
-  where :math:`D` is a training dataset of :math:`n` pairs :math:`(x_i, y_i)`.
+where :math:`D` is a training dataset of :math:`n` pairs :math:`(x_i, y_i)`.
 
-  In a classification tree, the predicted class probabilities within leaf nodes
-  are constant, that is: for all :math:`(x_i, y_i) \in Q_m`, one has:
-  :math:`T_k(x_i) = p_{mk}` for each class :math:`k`.
+In a classification tree, the predicted class probabilities within leaf nodes
+are constant, that is: for all :math:`(x_i, y_i) \in Q_m`, one has:
+:math:`T_k(x_i) = p_{mk}` for each class :math:`k`.
 
-  This property makes it possible to rewrite :math:`\mathrm{LL}(D, T)` as the
-  sum of the Shannon entropies computed for each leaf of :math:`T` weighted by
-  the number of training data points that reached each leaf:
+This property makes it possible to rewrite :math:`\mathrm{LL}(D, T)` as the
+sum of the Shannon entropies computed for each leaf of :math:`T` weighted by
+the number of training data points that reached each leaf:
 
-  .. math::
+.. math::
 
-      \mathrm{LL}(D, T) = \sum_{m \in T} \frac{n_m}{n} H(Q_m)
+    \mathrm{LL}(D, T) = \sum_{m \in T} \frac{n_m}{n} H(Q_m)
 
 |details-end|
 
@@ -605,50 +605,50 @@ the split with all the missing values going to the left node or the right node.
 
 Decisions are made as follows:
 
-    - By default when predicting, the samples with missing values are classified
-      with the class used in the split found during training::
+- By default when predicting, the samples with missing values are classified
+  with the class used in the split found during training::
 
-        >>> from sklearn.tree import DecisionTreeClassifier
-        >>> import numpy as np
+    >>> from sklearn.tree import DecisionTreeClassifier
+    >>> import numpy as np
 
-        >>> X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
-        >>> y = [0, 0, 1, 1]
+    >>> X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
+    >>> y = [0, 0, 1, 1]
 
-        >>> tree = DecisionTreeClassifier(random_state=0).fit(X, y)
-        >>> tree.predict(X)
-        array([0, 0, 1, 1])
+    >>> tree = DecisionTreeClassifier(random_state=0).fit(X, y)
+    >>> tree.predict(X)
+    array([0, 0, 1, 1])
 
-    - If the criterion evaluation is the same for both nodes,
-      then the tie for missing value at predict time is broken by going to the
-      right node. The splitter also checks the split where all the missing
-      values go to one child and non-missing values go to the other::
+- If the criterion evaluation is the same for both nodes,
+  then the tie for missing value at predict time is broken by going to the
+  right node. The splitter also checks the split where all the missing
+  values go to one child and non-missing values go to the other::
 
-        >>> from sklearn.tree import DecisionTreeClassifier
-        >>> import numpy as np
+    >>> from sklearn.tree import DecisionTreeClassifier
+    >>> import numpy as np
 
-        >>> X = np.array([np.nan, -1, np.nan, 1]).reshape(-1, 1)
-        >>> y = [0, 0, 1, 1]
+    >>> X = np.array([np.nan, -1, np.nan, 1]).reshape(-1, 1)
+    >>> y = [0, 0, 1, 1]
 
-        >>> tree = DecisionTreeClassifier(random_state=0).fit(X, y)
+    >>> tree = DecisionTreeClassifier(random_state=0).fit(X, y)
 
-        >>> X_test = np.array([np.nan]).reshape(-1, 1)
-        >>> tree.predict(X_test)
-        array([1])
+    >>> X_test = np.array([np.nan]).reshape(-1, 1)
+    >>> tree.predict(X_test)
+    array([1])
 
-    - If no missing values are seen during training for a given feature, then during
-      prediction missing values are mapped to the child with the most samples::
+- If no missing values are seen during training for a given feature, then during
+  prediction missing values are mapped to the child with the most samples::
 
-        >>> from sklearn.tree import DecisionTreeClassifier
-        >>> import numpy as np
+    >>> from sklearn.tree import DecisionTreeClassifier
+    >>> import numpy as np
 
-        >>> X = np.array([0, 1, 2, 3]).reshape(-1, 1)
-        >>> y = [0, 1, 1, 1]
+    >>> X = np.array([0, 1, 2, 3]).reshape(-1, 1)
+    >>> y = [0, 1, 1, 1]
 
-        >>> tree = DecisionTreeClassifier(random_state=0).fit(X, y)
+    >>> tree = DecisionTreeClassifier(random_state=0).fit(X, y)
 
-        >>> X_test = np.array([np.nan]).reshape(-1, 1)
-        >>> tree.predict(X_test)
-        array([1])
+    >>> X_test = np.array([np.nan]).reshape(-1, 1)
+    >>> tree.predict(X_test)
+    array([1])
 
 .. _minimal_cost_complexity_pruning:
 
@@ -693,17 +693,17 @@ be pruned. This process stops when the pruned tree's minimal
 **References**
 |details-split|
 
-    .. [BRE] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification
-      and Regression Trees. Wadsworth, Belmont, CA, 1984.
+.. [BRE] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification
+  and Regression Trees. Wadsworth, Belmont, CA, 1984.
 
-    * https://en.wikipedia.org/wiki/Decision_tree_learning
+* https://en.wikipedia.org/wiki/Decision_tree_learning
 
-    * https://en.wikipedia.org/wiki/Predictive_analytics
+* https://en.wikipedia.org/wiki/Predictive_analytics
 
-    * J.R. Quinlan. C4. 5: programs for machine learning. Morgan
-      Kaufmann, 1993.
+* J.R. Quinlan. C4. 5: programs for machine learning. Morgan
+  Kaufmann, 1993.
 
-    * T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical
-      Learning, Springer, 2009.
+* T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical
+  Learning, Springer, 2009.
 
 |details-end|
diff --git a/doc/presentations.rst b/doc/presentations.rst
index 47b7f16bd74a0..19fd09218b5fd 100644
--- a/doc/presentations.rst
+++ b/doc/presentations.rst
@@ -37,40 +37,40 @@ Videos
   <http://videolectures.net/icml2010_varaquaux_scik/>`_ by `Gael Varoquaux`_ at
   ICML 2010
 
-    A three minute video from a very early stage of scikit-learn, explaining the
-    basic idea and approach we are following.
+  A three minute video from a very early stage of scikit-learn, explaining the
+  basic idea and approach we are following.
 
 - `Introduction to statistical learning with scikit-learn <https://archive.org/search.php?query=scikit-learn>`_
   by `Gael Varoquaux`_ at SciPy 2011
 
-    An extensive tutorial, consisting of four sessions of one hour.
-    The tutorial covers the basics of machine learning,
-    many algorithms and how to apply them using scikit-learn. The
-    material corresponding is now in the scikit-learn documentation
-    section :ref:`stat_learn_tut_index`.
+  An extensive tutorial, consisting of four sessions of one hour.
+  The tutorial covers the basics of machine learning,
+  many algorithms and how to apply them using scikit-learn. The
+  material corresponding is now in the scikit-learn documentation
+  section :ref:`stat_learn_tut_index`.
 
 - `Statistical Learning for Text Classification with scikit-learn and NLTK
   <https://pyvideo.org/video/417/pycon-2011--statistical-machine-learning-for-text>`_
   (and `slides <https://www.slideshare.net/ogrisel/statistical-machine-learning-for-text-classification-with-scikitlearn-and-nltk>`_)
   by `Olivier Grisel`_ at PyCon 2011
 
-    Thirty minute introduction to text classification. Explains how to
-    use NLTK and scikit-learn to solve real-world text classification
-    tasks and compares against cloud-based solutions.
+  Thirty minute introduction to text classification. Explains how to
+  use NLTK and scikit-learn to solve real-world text classification
+  tasks and compares against cloud-based solutions.
 
 - `Introduction to Interactive Predictive Analytics in Python with scikit-learn <https://www.youtube.com/watch?v=Zd5dfooZWG4>`_
   by `Olivier Grisel`_ at PyCon 2012
 
-    3-hours long introduction to prediction tasks using scikit-learn.
+  3-hours long introduction to prediction tasks using scikit-learn.
 
 - `scikit-learn - Machine Learning in Python <https://www.youtube.com/watch?v=cHZONQ2-x7I>`_
   by `Jake Vanderplas`_ at the 2012 PyData workshop at Google
 
-    Interactive demonstration of some scikit-learn features. 75 minutes.
+  Interactive demonstration of some scikit-learn features. 75 minutes.
 
 - `scikit-learn tutorial <https://www.youtube.com/watch?v=cHZONQ2-x7I>`_ by `Jake Vanderplas`_ at PyData NYC 2012
 
-    Presentation using the online tutorial, 45 minutes.
+  Presentation using the online tutorial, 45 minutes.
 
 
 .. _Gael Varoquaux: https://gael-varoquaux.info
diff --git a/doc/support.rst b/doc/support.rst
index 520bd015ff6da..bb60f49c70716 100644
--- a/doc/support.rst
+++ b/doc/support.rst
@@ -60,11 +60,11 @@ https://github.com/scikit-learn/scikit-learn/issues
 
 Don't forget to include:
 
-  - steps (or better script) to reproduce,
+- steps (or better script) to reproduce,
 
-  - expected outcome,
+- expected outcome,
 
-  - observed outcome or Python (or gdb) tracebacks
+- observed outcome or Python (or gdb) tracebacks
 
 To help developers fix your bug faster, please link to a https://gist.github.com
 holding a standalone minimalistic python script that reproduces your bug and
diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst
index d983d7806dce6..27dddb4e0e909 100644
--- a/doc/tutorial/basic/tutorial.rst
+++ b/doc/tutorial/basic/tutorial.rst
@@ -23,41 +23,41 @@ data), it is said to have several attributes or **features**.
 
 Learning problems fall into a few categories:
 
- * `supervised learning <https://en.wikipedia.org/wiki/Supervised_learning>`_,
-   in which the data comes with additional attributes that we want to predict
-   (:ref:`Click here <supervised-learning>`
-   to go to the scikit-learn supervised learning page).This problem
-   can be either:
-
-    * `classification
-      <https://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:
-      samples belong to two or more classes and we
-      want to learn from already labeled data how to predict the class
-      of unlabeled data. An example of a classification problem would
-      be handwritten digit recognition, in which the aim is
-      to assign each input vector to one of a finite number of discrete
-      categories.  Another way to think of classification is as a discrete
-      (as opposed to continuous) form of supervised learning where one has a
-      limited number of categories and for each of the n samples provided,
-      one is to try to label them with the correct category or class.
-
-    * `regression <https://en.wikipedia.org/wiki/Regression_analysis>`_:
-      if the desired output consists of one or more
-      continuous variables, then the task is called *regression*. An
-      example of a regression problem would be the prediction of the
-      length of a salmon as a function of its age and weight.
-
- * `unsupervised learning <https://en.wikipedia.org/wiki/Unsupervised_learning>`_,
-   in which the training data consists of a set of input vectors x
-   without any corresponding target values. The goal in such problems
-   may be to discover groups of similar examples within the data, where
-   it is called `clustering <https://en.wikipedia.org/wiki/Cluster_analysis>`_,
-   or to determine the distribution of data within the input space, known as
-   `density estimation <https://en.wikipedia.org/wiki/Density_estimation>`_, or
-   to project the data from a high-dimensional space down to two or three
-   dimensions for the purpose of *visualization*
-   (:ref:`Click here <unsupervised-learning>`
-   to go to the Scikit-Learn unsupervised learning page).
+* `supervised learning <https://en.wikipedia.org/wiki/Supervised_learning>`_,
+  in which the data comes with additional attributes that we want to predict
+  (:ref:`Click here <supervised-learning>`
+  to go to the scikit-learn supervised learning page).This problem
+  can be either:
+
+  * `classification
+    <https://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:
+    samples belong to two or more classes and we
+    want to learn from already labeled data how to predict the class
+    of unlabeled data. An example of a classification problem would
+    be handwritten digit recognition, in which the aim is
+    to assign each input vector to one of a finite number of discrete
+    categories.  Another way to think of classification is as a discrete
+    (as opposed to continuous) form of supervised learning where one has a
+    limited number of categories and for each of the n samples provided,
+    one is to try to label them with the correct category or class.
+
+  * `regression <https://en.wikipedia.org/wiki/Regression_analysis>`_:
+    if the desired output consists of one or more
+    continuous variables, then the task is called *regression*. An
+    example of a regression problem would be the prediction of the
+    length of a salmon as a function of its age and weight.
+
+* `unsupervised learning <https://en.wikipedia.org/wiki/Unsupervised_learning>`_,
+  in which the training data consists of a set of input vectors x
+  without any corresponding target values. The goal in such problems
+  may be to discover groups of similar examples within the data, where
+  it is called `clustering <https://en.wikipedia.org/wiki/Cluster_analysis>`_,
+  or to determine the distribution of data within the input space, known as
+  `density estimation <https://en.wikipedia.org/wiki/Density_estimation>`_, or
+  to project the data from a high-dimensional space down to two or three
+  dimensions for the purpose of *visualization*
+  (:ref:`Click here <unsupervised-learning>`
+  to go to the Scikit-Learn unsupervised learning page).
 
 .. topic:: Training set and testing set
 
diff --git a/doc/tutorial/statistical_inference/model_selection.rst b/doc/tutorial/statistical_inference/model_selection.rst
index dd0cec4de4db0..bf0290c9f7337 100644
--- a/doc/tutorial/statistical_inference/model_selection.rst
+++ b/doc/tutorial/statistical_inference/model_selection.rst
@@ -98,7 +98,7 @@ scoring method.
     ...                 scoring='precision_macro')
     array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644])
 
-   **Cross-validation generators**
+**Cross-validation generators**
 
 
 .. list-table::
@@ -185,8 +185,8 @@ scoring method.
     estimator with a linear kernel as a function of parameter ``C`` (use a
     logarithmic grid of points, from 1 to 10).
 
-        .. literalinclude:: ../../auto_examples/exercises/plot_cv_digits.py
-            :lines: 13-23
+    .. literalinclude:: ../../auto_examples/exercises/plot_cv_digits.py
+        :lines: 13-23
 
     .. image:: /auto_examples/exercises/images/sphx_glr_plot_cv_digits_001.png
         :target: ../../auto_examples/exercises/plot_cv_digits.html
diff --git a/doc/tutorial/statistical_inference/putting_together.rst b/doc/tutorial/statistical_inference/putting_together.rst
index 033bed2e33884..b28ba77bfac33 100644
--- a/doc/tutorial/statistical_inference/putting_together.rst
+++ b/doc/tutorial/statistical_inference/putting_together.rst
@@ -25,7 +25,7 @@ Face recognition with eigenfaces
 The dataset used in this example is a preprocessed excerpt of the
 "Labeled Faces in the Wild", also known as LFW_:
 
-  http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
+http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
 
 .. _LFW: http://vis-www.cs.umass.edu/lfw/
 
diff --git a/doc/tutorial/statistical_inference/supervised_learning.rst b/doc/tutorial/statistical_inference/supervised_learning.rst
index d7477b279662d..45fc4cf5b9bc0 100644
--- a/doc/tutorial/statistical_inference/supervised_learning.rst
+++ b/doc/tutorial/statistical_inference/supervised_learning.rst
@@ -157,10 +157,10 @@ of the model as small as possible.
 
 Linear models: :math:`y = X\beta + \epsilon`
 
- * :math:`X`: data
- * :math:`y`: target variable
- * :math:`\beta`: Coefficients
- * :math:`\epsilon`: Observation noise
+* :math:`X`: data
+* :math:`y`: target variable
+* :math:`\beta`: Coefficients
+* :math:`\epsilon`: Observation noise
 
 .. image:: /auto_examples/linear_model/images/sphx_glr_plot_ols_001.png
    :target: ../../auto_examples/linear_model/plot_ols.html
diff --git a/doc/tutorial/statistical_inference/unsupervised_learning.rst b/doc/tutorial/statistical_inference/unsupervised_learning.rst
index e385eccaf592c..fd827cc75b212 100644
--- a/doc/tutorial/statistical_inference/unsupervised_learning.rst
+++ b/doc/tutorial/statistical_inference/unsupervised_learning.rst
@@ -12,7 +12,8 @@ Clustering: grouping observations together
     **clustering task**: split the observations into well-separated group
     called *clusters*.
 
-..
+::
+
    >>> # Set the PRNG
    >>> import numpy as np
    >>> np.random.seed(1)
@@ -100,18 +101,18 @@ A :ref:`hierarchical_clustering` method is a type of cluster analysis
 that aims to build a hierarchy of clusters. In general, the various approaches
 of this technique are either:
 
-  * **Agglomerative** - bottom-up approaches: each observation starts in its
-    own cluster, and clusters are iteratively merged in such a way to
-    minimize a *linkage* criterion. This approach is particularly interesting
-    when the clusters of interest are made of only a few observations. When
-    the number of clusters is large, it is much more computationally efficient
-    than k-means.
-
-  * **Divisive** - top-down approaches: all observations start in one
-    cluster, which is iteratively split as one moves down the hierarchy.
-    For estimating large numbers of clusters, this approach is both slow (due
-    to all observations starting as one cluster, which it splits recursively)
-    and statistically ill-posed.
+* **Agglomerative** - bottom-up approaches: each observation starts in its
+  own cluster, and clusters are iteratively merged in such a way to
+  minimize a *linkage* criterion. This approach is particularly interesting
+  when the clusters of interest are made of only a few observations. When
+  the number of clusters is large, it is much more computationally efficient
+  than k-means.
+
+* **Divisive** - top-down approaches: all observations start in one
+  cluster, which is iteratively split as one moves down the hierarchy.
+  For estimating large numbers of clusters, this approach is both slow (due
+  to all observations starting as one cluster, which it splits recursively)
+  and statistically ill-posed.
 
 Connectivity-constrained clustering
 .....................................
diff --git a/doc/tutorial/text_analytics/working_with_text_data.rst b/doc/tutorial/text_analytics/working_with_text_data.rst
index 0880fe3118e4f..43fd305c3b8b6 100644
--- a/doc/tutorial/text_analytics/working_with_text_data.rst
+++ b/doc/tutorial/text_analytics/working_with_text_data.rst
@@ -10,14 +10,14 @@ documents (newsgroups posts) on twenty different topics.
 
 In this section we will see how to:
 
-  - load the file contents and the categories
+- load the file contents and the categories
 
-  - extract feature vectors suitable for machine learning
+- extract feature vectors suitable for machine learning
 
-  - train a linear model to perform categorization
+- train a linear model to perform categorization
 
-  - use a grid search strategy to find a good configuration of both
-    the feature extraction components and the classifier
+- use a grid search strategy to find a good configuration of both
+  the feature extraction components and the classifier
 
 
 Tutorial setup
@@ -38,13 +38,13 @@ The source can also be found `on Github
 
 The tutorial folder should contain the following sub-folders:
 
-  * ``*.rst files`` - the source of the tutorial document written with sphinx
+* ``*.rst files`` - the source of the tutorial document written with sphinx
 
-  * ``data`` - folder to put the datasets used during the tutorial
+* ``data`` - folder to put the datasets used during the tutorial
 
-  * ``skeletons`` - sample incomplete scripts for the exercises
+* ``skeletons`` - sample incomplete scripts for the exercises
 
-  * ``solutions`` - solutions of the exercises
+* ``solutions`` - solutions of the exercises
 
 
 You can already copy the skeletons into a new folder somewhere
@@ -180,13 +180,13 @@ Bags of words
 
 The most intuitive way to do so is to use a bags of words representation:
 
-  1. Assign a fixed integer id to each word occurring in any document
-     of the training set (for instance by building a dictionary
-     from words to integer indices).
+1. Assign a fixed integer id to each word occurring in any document
+   of the training set (for instance by building a dictionary
+   from words to integer indices).
 
-  2. For each document ``#i``, count the number of occurrences of each
-     word ``w`` and store it in ``X[i, j]`` as the value of feature
-     ``#j`` where ``j`` is the index of word ``w`` in the dictionary.
+2. For each document ``#i``, count the number of occurrences of each
+   word ``w`` and store it in ``X[i, j]`` as the value of feature
+   ``#j`` where ``j`` is the index of word ``w`` in the dictionary.
 
 The bags of words representation implies that ``n_features`` is
 the number of distinct words in the corpus: this number is typically
diff --git a/doc/whats_new/older_versions.rst b/doc/whats_new/older_versions.rst
index 5a1d6a1c7c13f..12ed10a6206f4 100644
--- a/doc/whats_new/older_versions.rst
+++ b/doc/whats_new/older_versions.rst
@@ -40,14 +40,14 @@ Changelog
 People
 ------
 
- *  14  `Peter Prettenhofer`_
- *  12  `Gael Varoquaux`_
- *  10  `Andreas Müller`_
- *   5  `Lars Buitinck`_
- *   3  :user:`Virgile Fritsch <VirgileFritsch>`
- *   1  `Alexandre Gramfort`_
- *   1  `Gilles Louppe`_
- *   1  `Mathieu Blondel`_
+*  14  `Peter Prettenhofer`_
+*  12  `Gael Varoquaux`_
+*  10  `Andreas Müller`_
+*   5  `Lars Buitinck`_
+*   3  :user:`Virgile Fritsch <VirgileFritsch>`
+*   1  `Alexandre Gramfort`_
+*   1  `Gilles Louppe`_
+*   1  `Mathieu Blondel`_
 
 .. _changes_0_12:
 
@@ -194,53 +194,53 @@ API changes summary
 
 People
 ------
- * 267  `Andreas Müller`_
- *  94  `Gilles Louppe`_
- *  89  `Gael Varoquaux`_
- *  79  `Peter Prettenhofer`_
- *  60  `Mathieu Blondel`_
- *  57  `Alexandre Gramfort`_
- *  52  `Vlad Niculae`_
- *  45  `Lars Buitinck`_
- *  44  Nelle Varoquaux
- *  37  `Jaques Grobler`_
- *  30  Alexis Mignon
- *  30  Immanuel Bayer
- *  27  `Olivier Grisel`_
- *  16  Subhodeep Moitra
- *  13  Yannick Schwartz
- *  12  :user:`@kernc <kernc>`
- *  11  :user:`Virgile Fritsch <VirgileFritsch>`
- *   9  Daniel Duckworth
- *   9  `Fabian Pedregosa`_
- *   9  `Robert Layton`_
- *   8  John Benediktsson
- *   7  Marko Burjek
- *   5  `Nicolas Pinto`_
- *   4  Alexandre Abraham
- *   4  `Jake Vanderplas`_
- *   3  `Brian Holt`_
- *   3  `Edouard Duchesnay`_
- *   3  Florian Hoenig
- *   3  flyingimmidev
- *   2  Francois Savard
- *   2  Hannes Schulz
- *   2  Peter Welinder
- *   2  `Yaroslav Halchenko`_
- *   2  Wei Li
- *   1  Alex Companioni
- *   1  Brandyn A. White
- *   1  Bussonnier Matthias
- *   1  Charles-Pierre Astolfi
- *   1  Dan O'Huiginn
- *   1  David Cournapeau
- *   1  Keith Goodman
- *   1  Ludwig Schwardt
- *   1  Olivier Hervieu
- *   1  Sergio Medina
- *   1  Shiqiao Du
- *   1  Tim Sheerman-Chase
- *   1  buguen
+* 267  `Andreas Müller`_
+*  94  `Gilles Louppe`_
+*  89  `Gael Varoquaux`_
+*  79  `Peter Prettenhofer`_
+*  60  `Mathieu Blondel`_
+*  57  `Alexandre Gramfort`_
+*  52  `Vlad Niculae`_
+*  45  `Lars Buitinck`_
+*  44  Nelle Varoquaux
+*  37  `Jaques Grobler`_
+*  30  Alexis Mignon
+*  30  Immanuel Bayer
+*  27  `Olivier Grisel`_
+*  16  Subhodeep Moitra
+*  13  Yannick Schwartz
+*  12  :user:`@kernc <kernc>`
+*  11  :user:`Virgile Fritsch <VirgileFritsch>`
+*   9  Daniel Duckworth
+*   9  `Fabian Pedregosa`_
+*   9  `Robert Layton`_
+*   8  John Benediktsson
+*   7  Marko Burjek
+*   5  `Nicolas Pinto`_
+*   4  Alexandre Abraham
+*   4  `Jake Vanderplas`_
+*   3  `Brian Holt`_
+*   3  `Edouard Duchesnay`_
+*   3  Florian Hoenig
+*   3  flyingimmidev
+*   2  Francois Savard
+*   2  Hannes Schulz
+*   2  Peter Welinder
+*   2  `Yaroslav Halchenko`_
+*   2  Wei Li
+*   1  Alex Companioni
+*   1  Brandyn A. White
+*   1  Bussonnier Matthias
+*   1  Charles-Pierre Astolfi
+*   1  Dan O'Huiginn
+*   1  David Cournapeau
+*   1  Keith Goodman
+*   1  Ludwig Schwardt
+*   1  Olivier Hervieu
+*   1  Sergio Medina
+*   1  Shiqiao Du
+*   1  Tim Sheerman-Chase
+*   1  buguen
 
 
 
@@ -431,54 +431,55 @@ API changes summary
 
 People
 ------
-   * 282  `Andreas Müller`_
-   * 239  `Peter Prettenhofer`_
-   * 198  `Gael Varoquaux`_
-   * 129  `Olivier Grisel`_
-   * 114  `Mathieu Blondel`_
-   * 103  Clay Woolam
-   *  96  `Lars Buitinck`_
-   *  88  `Jaques Grobler`_
-   *  82  `Alexandre Gramfort`_
-   *  50  `Bertrand Thirion`_
-   *  42  `Robert Layton`_
-   *  28  flyingimmidev
-   *  26  `Jake Vanderplas`_
-   *  26  Shiqiao Du
-   *  21  `Satrajit Ghosh`_
-   *  17  `David Marek`_
-   *  17  `Gilles Louppe`_
-   *  14  `Vlad Niculae`_
-   *  11  Yannick Schwartz
-   *  10  `Fabian Pedregosa`_
-   *   9  fcostin
-   *   7  Nick Wilson
-   *   5  Adrien Gaidon
-   *   5  `Nicolas Pinto`_
-   *   4  `David Warde-Farley`_
-   *   5  Nelle Varoquaux
-   *   5  Emmanuelle Gouillart
-   *   3  Joonas Sillanpää
-   *   3  Paolo Losi
-   *   2  Charles McCarthy
-   *   2  Roy Hyunjin Han
-   *   2  Scott White
-   *   2  ibayer
-   *   1  Brandyn White
-   *   1  Carlos Scheidegger
-   *   1  Claire Revillet
-   *   1  Conrad Lee
-   *   1  `Edouard Duchesnay`_
-   *   1  Jan Hendrik Metzen
-   *   1  Meng Xinfan
-   *   1  `Rob Zinkov`_
-   *   1  Shiqiao
-   *   1  Udi Weinsberg
-   *   1  Virgile Fritsch
-   *   1  Xinfan Meng
-   *   1  Yaroslav Halchenko
-   *   1  jansoe
-   *   1  Leon Palafox
+
+* 282  `Andreas Müller`_
+* 239  `Peter Prettenhofer`_
+* 198  `Gael Varoquaux`_
+* 129  `Olivier Grisel`_
+* 114  `Mathieu Blondel`_
+* 103  Clay Woolam
+*  96  `Lars Buitinck`_
+*  88  `Jaques Grobler`_
+*  82  `Alexandre Gramfort`_
+*  50  `Bertrand Thirion`_
+*  42  `Robert Layton`_
+*  28  flyingimmidev
+*  26  `Jake Vanderplas`_
+*  26  Shiqiao Du
+*  21  `Satrajit Ghosh`_
+*  17  `David Marek`_
+*  17  `Gilles Louppe`_
+*  14  `Vlad Niculae`_
+*  11  Yannick Schwartz
+*  10  `Fabian Pedregosa`_
+*   9  fcostin
+*   7  Nick Wilson
+*   5  Adrien Gaidon
+*   5  `Nicolas Pinto`_
+*   4  `David Warde-Farley`_
+*   5  Nelle Varoquaux
+*   5  Emmanuelle Gouillart
+*   3  Joonas Sillanpää
+*   3  Paolo Losi
+*   2  Charles McCarthy
+*   2  Roy Hyunjin Han
+*   2  Scott White
+*   2  ibayer
+*   1  Brandyn White
+*   1  Carlos Scheidegger
+*   1  Claire Revillet
+*   1  Conrad Lee
+*   1  `Edouard Duchesnay`_
+*   1  Jan Hendrik Metzen
+*   1  Meng Xinfan
+*   1  `Rob Zinkov`_
+*   1  Shiqiao
+*   1  Udi Weinsberg
+*   1  Virgile Fritsch
+*   1  Xinfan Meng
+*   1  Yaroslav Halchenko
+*   1  jansoe
+*   1  Leon Palafox
 
 
 .. _changes_0_10:
@@ -634,37 +635,37 @@ People
 
 The following people contributed to scikit-learn since last release:
 
-   * 246  `Andreas Müller`_
-   * 242  `Olivier Grisel`_
-   * 220  `Gilles Louppe`_
-   * 183  `Brian Holt`_
-   * 166  `Gael Varoquaux`_
-   * 144  `Lars Buitinck`_
-   *  73  `Vlad Niculae`_
-   *  65  `Peter Prettenhofer`_
-   *  64  `Fabian Pedregosa`_
-   *  60  Robert Layton
-   *  55  `Mathieu Blondel`_
-   *  52  `Jake Vanderplas`_
-   *  44  Noel Dawe
-   *  38  `Alexandre Gramfort`_
-   *  24  :user:`Virgile Fritsch <VirgileFritsch>`
-   *  23  `Satrajit Ghosh`_
-   *   3  Jan Hendrik Metzen
-   *   3  Kenneth C. Arnold
-   *   3  Shiqiao Du
-   *   3  Tim Sheerman-Chase
-   *   3  `Yaroslav Halchenko`_
-   *   2  Bala Subrahmanyam Varanasi
-   *   2  DraXus
-   *   2  Michael Eickenberg
-   *   1  Bogdan Trach
-   *   1  Félix-Antoine Fortin
-   *   1  Juan Manuel Caicedo Carvajal
-   *   1  Nelle Varoquaux
-   *   1  `Nicolas Pinto`_
-   *   1  Tiziano Zito
-   *   1  Xinfan Meng
+* 246  `Andreas Müller`_
+* 242  `Olivier Grisel`_
+* 220  `Gilles Louppe`_
+* 183  `Brian Holt`_
+* 166  `Gael Varoquaux`_
+* 144  `Lars Buitinck`_
+*  73  `Vlad Niculae`_
+*  65  `Peter Prettenhofer`_
+*  64  `Fabian Pedregosa`_
+*  60  Robert Layton
+*  55  `Mathieu Blondel`_
+*  52  `Jake Vanderplas`_
+*  44  Noel Dawe
+*  38  `Alexandre Gramfort`_
+*  24  :user:`Virgile Fritsch <VirgileFritsch>`
+*  23  `Satrajit Ghosh`_
+*   3  Jan Hendrik Metzen
+*   3  Kenneth C. Arnold
+*   3  Shiqiao Du
+*   3  Tim Sheerman-Chase
+*   3  `Yaroslav Halchenko`_
+*   2  Bala Subrahmanyam Varanasi
+*   2  DraXus
+*   2  Michael Eickenberg
+*   1  Bogdan Trach
+*   1  Félix-Antoine Fortin
+*   1  Juan Manuel Caicedo Carvajal
+*   1  Nelle Varoquaux
+*   1  `Nicolas Pinto`_
+*   1  Tiziano Zito
+*   1  Xinfan Meng
 
 
 
@@ -993,20 +994,20 @@ People that made this release possible preceded by number of commits:
 - 25  `Peter Prettenhofer`_
 - 22  `Nicolas Pinto`_
 - 11  :user:`Virgile Fritsch <VirgileFritsch>`
-   -  7  Lars Buitinck
-   -  6  Vincent Michel
-   -  5  `Bertrand Thirion`_
-   -  4  Thouis (Ray) Jones
-   -  4  Vincent Schut
-   -  3  Jan Schlüter
-   -  2  Julien Miotte
-   -  2  `Matthieu Perrot`_
-   -  2  Yann Malet
-   -  2  `Yaroslav Halchenko`_
-   -  1  Amit Aides
-   -  1  `Andreas Müller`_
-   -  1  Feth Arezki
-   -  1  Meng Xinfan
+-  7  Lars Buitinck
+-  6  Vincent Michel
+-  5  `Bertrand Thirion`_
+-  4  Thouis (Ray) Jones
+-  4  Vincent Schut
+-  3  Jan Schlüter
+-  2  Julien Miotte
+-  2  `Matthieu Perrot`_
+-  2  Yann Malet
+-  2  `Yaroslav Halchenko`_
+-  1  Amit Aides
+-  1  `Andreas Müller`_
+-  1  Feth Arezki
+-  1  Meng Xinfan
 
 
 .. _changes_0_7:
@@ -1175,31 +1176,31 @@ People
 
 People that made this release possible preceded by number of commits:
 
-   * 207  `Olivier Grisel`_
+* 207  `Olivier Grisel`_
 
-   * 167 `Fabian Pedregosa`_
+* 167 `Fabian Pedregosa`_
 
-   * 97 `Peter Prettenhofer`_
+* 97 `Peter Prettenhofer`_
 
-   * 68 `Alexandre Gramfort`_
+* 68 `Alexandre Gramfort`_
 
-   * 59  `Mathieu Blondel`_
+* 59  `Mathieu Blondel`_
 
-   * 55  `Gael Varoquaux`_
+* 55  `Gael Varoquaux`_
 
-   * 33  Vincent Dubourg
+* 33  Vincent Dubourg
 
-   * 21  `Ron Weiss`_
+* 21  `Ron Weiss`_
 
-   * 9  Bertrand Thirion
+* 9  Bertrand Thirion
 
-   * 3  `Alexandre Passos`_
+* 3  `Alexandre Passos`_
 
-   * 3  Anne-Laure Fouque
+* 3  Anne-Laure Fouque
 
-   * 2  Ronan Amicel
+* 2  Ronan Amicel
 
-   * 1 `Christian Osendorfer`_
+* 1 `Christian Osendorfer`_
 
 
 
@@ -1304,20 +1305,20 @@ Authors
 The following is a list of authors for this release, preceded by
 number of commits:
 
-     * 262  Fabian Pedregosa
-     * 240  Gael Varoquaux
-     * 149  Alexandre Gramfort
-     * 116  Olivier Grisel
-     *  40  Vincent Michel
-     *  38  Ron Weiss
-     *  23  Matthieu Perrot
-     *  10  Bertrand Thirion
-     *   7  Yaroslav Halchenko
-     *   9  VirgileFritsch
-     *   6  Edouard Duchesnay
-     *   4  Mathieu Blondel
-     *   1  Ariel Rokem
-     *   1  Matthieu Brucher
+* 262  Fabian Pedregosa
+* 240  Gael Varoquaux
+* 149  Alexandre Gramfort
+* 116  Olivier Grisel
+*  40  Vincent Michel
+*  38  Ron Weiss
+*  23  Matthieu Perrot
+*  10  Bertrand Thirion
+*   7  Yaroslav Halchenko
+*   9  VirgileFritsch
+*   6  Edouard Duchesnay
+*   4  Mathieu Blondel
+*   1  Ariel Rokem
+*   1  Matthieu Brucher
 
 Version 0.4
 ===========
@@ -1368,13 +1369,13 @@ Authors
 The committer list for this release is the following (preceded by number
 of commits):
 
-    * 143  Fabian Pedregosa
-    * 35  Alexandre Gramfort
-    * 34  Olivier Grisel
-    * 11  Gael Varoquaux
-    *  5  Yaroslav Halchenko
-    *  2  Vincent Michel
-    *  1  Chris Filo Gorgolewski
+* 143  Fabian Pedregosa
+* 35  Alexandre Gramfort
+* 34  Olivier Grisel
+* 11  Gael Varoquaux
+*  5  Yaroslav Halchenko
+*  2  Vincent Michel
+*  1  Chris Filo Gorgolewski
 
 
 Earlier versions
diff --git a/doc/whats_new/v0.13.rst b/doc/whats_new/v0.13.rst
index 00be322bf38fc..6c24d1c52b150 100644
--- a/doc/whats_new/v0.13.rst
+++ b/doc/whats_new/v0.13.rst
@@ -33,21 +33,22 @@ Changelog
 People
 ------
 List of contributors for release 0.13.1 by number of commits.
- * 16  `Lars Buitinck`_
- * 12  `Andreas Müller`_
- *  8  `Gael Varoquaux`_
- *  5  Robert Marchman
- *  3  `Peter Prettenhofer`_
- *  2  Hrishikesh Huilgolkar
- *  1  Bastiaan van den Berg
- *  1  Diego Molla
- *  1  `Gilles Louppe`_
- *  1  `Mathieu Blondel`_
- *  1  `Nelle Varoquaux`_
- *  1  Rafael Cunha de Almeida
- *  1  Rolando Espinoza La fuente
- *  1  `Vlad Niculae`_
- *  1  `Yaroslav Halchenko`_
+
+* 16  `Lars Buitinck`_
+* 12  `Andreas Müller`_
+*  8  `Gael Varoquaux`_
+*  5  Robert Marchman
+*  3  `Peter Prettenhofer`_
+*  2  Hrishikesh Huilgolkar
+*  1  Bastiaan van den Berg
+*  1  Diego Molla
+*  1  `Gilles Louppe`_
+*  1  `Mathieu Blondel`_
+*  1  `Nelle Varoquaux`_
+*  1  Rafael Cunha de Almeida
+*  1  Rolando Espinoza La fuente
+*  1  `Vlad Niculae`_
+*  1  `Yaroslav Halchenko`_
 
 
 .. _changes_0_13:
@@ -323,69 +324,69 @@ People
 ------
 List of contributors for release 0.13 by number of commits.
 
- * 364  `Andreas Müller`_
- * 143  `Arnaud Joly`_
- * 137  `Peter Prettenhofer`_
- * 131  `Gael Varoquaux`_
- * 117  `Mathieu Blondel`_
- * 108  `Lars Buitinck`_
- * 106  Wei Li
- * 101  `Olivier Grisel`_
- *  65  `Vlad Niculae`_
- *  54  `Gilles Louppe`_
- *  40  `Jaques Grobler`_
- *  38  `Alexandre Gramfort`_
- *  30  `Rob Zinkov`_
- *  19  Aymeric Masurelle
- *  18  Andrew Winterman
- *  17  `Fabian Pedregosa`_
- *  17  Nelle Varoquaux
- *  16  `Christian Osendorfer`_
- *  14  `Daniel Nouri`_
- *  13  :user:`Virgile Fritsch <VirgileFritsch>`
- *  13  syhw
- *  12  `Satrajit Ghosh`_
- *  10  Corey Lynch
- *  10  Kyle Beauchamp
- *   9  Brian Cheung
- *   9  Immanuel Bayer
- *   9  mr.Shu
- *   8  Conrad Lee
- *   8  `James Bergstra`_
- *   7  Tadej Janež
- *   6  Brian Cajes
- *   6  `Jake Vanderplas`_
- *   6  Michael
- *   6  Noel Dawe
- *   6  Tiago Nunes
- *   6  cow
- *   5  Anze
- *   5  Shiqiao Du
- *   4  Christian Jauvin
- *   4  Jacques Kvam
- *   4  Richard T. Guy
- *   4  `Robert Layton`_
- *   3  Alexandre Abraham
- *   3  Doug Coleman
- *   3  Scott Dickerson
- *   2  ApproximateIdentity
- *   2  John Benediktsson
- *   2  Mark Veronda
- *   2  Matti Lyra
- *   2  Mikhail Korobov
- *   2  Xinfan Meng
- *   1  Alejandro Weinstein
- *   1  `Alexandre Passos`_
- *   1  Christoph Deil
- *   1  Eugene Nizhibitsky
- *   1  Kenneth C. Arnold
- *   1  Luis Pedro Coelho
- *   1  Miroslav Batchkarov
- *   1  Pavel
- *   1  Sebastian Berg
- *   1  Shaun Jackman
- *   1  Subhodeep Moitra
- *   1  bob
- *   1  dengemann
- *   1  emanuele
- *   1  x006
+* 364  `Andreas Müller`_
+* 143  `Arnaud Joly`_
+* 137  `Peter Prettenhofer`_
+* 131  `Gael Varoquaux`_
+* 117  `Mathieu Blondel`_
+* 108  `Lars Buitinck`_
+* 106  Wei Li
+* 101  `Olivier Grisel`_
+*  65  `Vlad Niculae`_
+*  54  `Gilles Louppe`_
+*  40  `Jaques Grobler`_
+*  38  `Alexandre Gramfort`_
+*  30  `Rob Zinkov`_
+*  19  Aymeric Masurelle
+*  18  Andrew Winterman
+*  17  `Fabian Pedregosa`_
+*  17  Nelle Varoquaux
+*  16  `Christian Osendorfer`_
+*  14  `Daniel Nouri`_
+*  13  :user:`Virgile Fritsch <VirgileFritsch>`
+*  13  syhw
+*  12  `Satrajit Ghosh`_
+*  10  Corey Lynch
+*  10  Kyle Beauchamp
+*   9  Brian Cheung
+*   9  Immanuel Bayer
+*   9  mr.Shu
+*   8  Conrad Lee
+*   8  `James Bergstra`_
+*   7  Tadej Janež
+*   6  Brian Cajes
+*   6  `Jake Vanderplas`_
+*   6  Michael
+*   6  Noel Dawe
+*   6  Tiago Nunes
+*   6  cow
+*   5  Anze
+*   5  Shiqiao Du
+*   4  Christian Jauvin
+*   4  Jacques Kvam
+*   4  Richard T. Guy
+*   4  `Robert Layton`_
+*   3  Alexandre Abraham
+*   3  Doug Coleman
+*   3  Scott Dickerson
+*   2  ApproximateIdentity
+*   2  John Benediktsson
+*   2  Mark Veronda
+*   2  Matti Lyra
+*   2  Mikhail Korobov
+*   2  Xinfan Meng
+*   1  Alejandro Weinstein
+*   1  `Alexandre Passos`_
+*   1  Christoph Deil
+*   1  Eugene Nizhibitsky
+*   1  Kenneth C. Arnold
+*   1  Luis Pedro Coelho
+*   1  Miroslav Batchkarov
+*   1  Pavel
+*   1  Sebastian Berg
+*   1  Shaun Jackman
+*   1  Subhodeep Moitra
+*   1  bob
+*   1  dengemann
+*   1  emanuele
+*   1  x006
diff --git a/doc/whats_new/v0.14.rst b/doc/whats_new/v0.14.rst
index 4bd04ad180c4e..74ef162e20e5a 100644
--- a/doc/whats_new/v0.14.rst
+++ b/doc/whats_new/v0.14.rst
@@ -297,91 +297,91 @@ People
 ------
 List of contributors for release 0.14 by number of commits.
 
- * 277  Gilles Louppe
- * 245  Lars Buitinck
- * 187  Andreas Mueller
- * 124  Arnaud Joly
- * 112  Jaques Grobler
- * 109  Gael Varoquaux
- * 107  Olivier Grisel
- * 102  Noel Dawe
- *  99  Kemal Eren
- *  79  Joel Nothman
- *  75  Jake VanderPlas
- *  73  Nelle Varoquaux
- *  71  Vlad Niculae
- *  65  Peter Prettenhofer
- *  64  Alexandre Gramfort
- *  54  Mathieu Blondel
- *  38  Nicolas Trésegnie
- *  35  eustache
- *  27  Denis Engemann
- *  25  Yann N. Dauphin
- *  19  Justin Vincent
- *  17  Robert Layton
- *  15  Doug Coleman
- *  14  Michael Eickenberg
- *  13  Robert Marchman
- *  11  Fabian Pedregosa
- *  11  Philippe Gervais
- *  10  Jim Holmström
- *  10  Tadej Janež
- *  10  syhw
- *   9  Mikhail Korobov
- *   9  Steven De Gryze
- *   8  sergeyf
- *   7  Ben Root
- *   7  Hrishikesh Huilgolkar
- *   6  Kyle Kastner
- *   6  Martin Luessi
- *   6  Rob Speer
- *   5  Federico Vaggi
- *   5  Raul Garreta
- *   5  Rob Zinkov
- *   4  Ken Geis
- *   3  A. Flaxman
- *   3  Denton Cockburn
- *   3  Dougal Sutherland
- *   3  Ian Ozsvald
- *   3  Johannes Schönberger
- *   3  Robert McGibbon
- *   3  Roman Sinayev
- *   3  Szabo Roland
- *   2  Diego Molla
- *   2  Imran Haque
- *   2  Jochen Wersdörfer
- *   2  Sergey Karayev
- *   2  Yannick Schwartz
- *   2  jamestwebber
- *   1  Abhijeet Kolhe
- *   1  Alexander Fabisch
- *   1  Bastiaan van den Berg
- *   1  Benjamin Peterson
- *   1  Daniel Velkov
- *   1  Fazlul Shahriar
- *   1  Felix Brockherde
- *   1  Félix-Antoine Fortin
- *   1  Harikrishnan S
- *   1  Jack Hale
- *   1  JakeMick
- *   1  James McDermott
- *   1  John Benediktsson
- *   1  John Zwinck
- *   1  Joshua Vredevoogd
- *   1  Justin Pati
- *   1  Kevin Hughes
- *   1  Kyle Kelley
- *   1  Matthias Ekman
- *   1  Miroslav Shubernetskiy
- *   1  Naoki Orii
- *   1  Norbert Crombach
- *   1  Rafael Cunha de Almeida
- *   1  Rolando Espinoza La fuente
- *   1  Seamus Abshere
- *   1  Sergey Feldman
- *   1  Sergio Medina
- *   1  Stefano Lattarini
- *   1  Steve Koch
- *   1  Sturla Molden
- *   1  Thomas Jarosch
- *   1  Yaroslav Halchenko
+* 277  Gilles Louppe
+* 245  Lars Buitinck
+* 187  Andreas Mueller
+* 124  Arnaud Joly
+* 112  Jaques Grobler
+* 109  Gael Varoquaux
+* 107  Olivier Grisel
+* 102  Noel Dawe
+*  99  Kemal Eren
+*  79  Joel Nothman
+*  75  Jake VanderPlas
+*  73  Nelle Varoquaux
+*  71  Vlad Niculae
+*  65  Peter Prettenhofer
+*  64  Alexandre Gramfort
+*  54  Mathieu Blondel
+*  38  Nicolas Trésegnie
+*  35  eustache
+*  27  Denis Engemann
+*  25  Yann N. Dauphin
+*  19  Justin Vincent
+*  17  Robert Layton
+*  15  Doug Coleman
+*  14  Michael Eickenberg
+*  13  Robert Marchman
+*  11  Fabian Pedregosa
+*  11  Philippe Gervais
+*  10  Jim Holmström
+*  10  Tadej Janež
+*  10  syhw
+*   9  Mikhail Korobov
+*   9  Steven De Gryze
+*   8  sergeyf
+*   7  Ben Root
+*   7  Hrishikesh Huilgolkar
+*   6  Kyle Kastner
+*   6  Martin Luessi
+*   6  Rob Speer
+*   5  Federico Vaggi
+*   5  Raul Garreta
+*   5  Rob Zinkov
+*   4  Ken Geis
+*   3  A. Flaxman
+*   3  Denton Cockburn
+*   3  Dougal Sutherland
+*   3  Ian Ozsvald
+*   3  Johannes Schönberger
+*   3  Robert McGibbon
+*   3  Roman Sinayev
+*   3  Szabo Roland
+*   2  Diego Molla
+*   2  Imran Haque
+*   2  Jochen Wersdörfer
+*   2  Sergey Karayev
+*   2  Yannick Schwartz
+*   2  jamestwebber
+*   1  Abhijeet Kolhe
+*   1  Alexander Fabisch
+*   1  Bastiaan van den Berg
+*   1  Benjamin Peterson
+*   1  Daniel Velkov
+*   1  Fazlul Shahriar
+*   1  Felix Brockherde
+*   1  Félix-Antoine Fortin
+*   1  Harikrishnan S
+*   1  Jack Hale
+*   1  JakeMick
+*   1  James McDermott
+*   1  John Benediktsson
+*   1  John Zwinck
+*   1  Joshua Vredevoogd
+*   1  Justin Pati
+*   1  Kevin Hughes
+*   1  Kyle Kelley
+*   1  Matthias Ekman
+*   1  Miroslav Shubernetskiy
+*   1  Naoki Orii
+*   1  Norbert Crombach
+*   1  Rafael Cunha de Almeida
+*   1  Rolando Espinoza La fuente
+*   1  Seamus Abshere
+*   1  Sergey Feldman
+*   1  Sergio Medina
+*   1  Stefano Lattarini
+*   1  Steve Koch
+*   1  Sturla Molden
+*   1  Thomas Jarosch
+*   1  Yaroslav Halchenko
diff --git a/doc/whats_new/v0.20.rst b/doc/whats_new/v0.20.rst
index 55c3aa5ef59e2..b295205bbbe57 100644
--- a/doc/whats_new/v0.20.rst
+++ b/doc/whats_new/v0.20.rst
@@ -53,7 +53,7 @@ The bundled version of joblib was upgraded from 0.13.0 to 0.13.2.
   restored from a pickle if ``sample_weight`` had been used.
   :issue:`13772` by :user:`Aditya Vyas <aditya1702>`.
 
- .. _changes_0_20_3:
+.. _changes_0_20_3:
 
 Version 0.20.3
 ==============
diff --git a/sklearn/datasets/descr/breast_cancer.rst b/sklearn/datasets/descr/breast_cancer.rst
index a532ef960737f..ceabd33e14ddc 100644
--- a/sklearn/datasets/descr/breast_cancer.rst
+++ b/sklearn/datasets/descr/breast_cancer.rst
@@ -5,77 +5,77 @@ Breast cancer wisconsin (diagnostic) dataset
 
 **Data Set Characteristics:**
 
-    :Number of Instances: 569
-
-    :Number of Attributes: 30 numeric, predictive attributes and the class
-
-    :Attribute Information:
-        - radius (mean of distances from center to points on the perimeter)
-        - texture (standard deviation of gray-scale values)
-        - perimeter
-        - area
-        - smoothness (local variation in radius lengths)
-        - compactness (perimeter^2 / area - 1.0)
-        - concavity (severity of concave portions of the contour)
-        - concave points (number of concave portions of the contour)
-        - symmetry
-        - fractal dimension ("coastline approximation" - 1)
-
-        The mean, standard error, and "worst" or largest (mean of the three
-        worst/largest values) of these features were computed for each image,
-        resulting in 30 features.  For instance, field 0 is Mean Radius, field
-        10 is Radius SE, field 20 is Worst Radius.
-
-        - class:
-                - WDBC-Malignant
-                - WDBC-Benign
-
-    :Summary Statistics:
-
-    ===================================== ====== ======
-                                           Min    Max
-    ===================================== ====== ======
-    radius (mean):                        6.981  28.11
-    texture (mean):                       9.71   39.28
-    perimeter (mean):                     43.79  188.5
-    area (mean):                          143.5  2501.0
-    smoothness (mean):                    0.053  0.163
-    compactness (mean):                   0.019  0.345
-    concavity (mean):                     0.0    0.427
-    concave points (mean):                0.0    0.201
-    symmetry (mean):                      0.106  0.304
-    fractal dimension (mean):             0.05   0.097
-    radius (standard error):              0.112  2.873
-    texture (standard error):             0.36   4.885
-    perimeter (standard error):           0.757  21.98
-    area (standard error):                6.802  542.2
-    smoothness (standard error):          0.002  0.031
-    compactness (standard error):         0.002  0.135
-    concavity (standard error):           0.0    0.396
-    concave points (standard error):      0.0    0.053
-    symmetry (standard error):            0.008  0.079
-    fractal dimension (standard error):   0.001  0.03
-    radius (worst):                       7.93   36.04
-    texture (worst):                      12.02  49.54
-    perimeter (worst):                    50.41  251.2
-    area (worst):                         185.2  4254.0
-    smoothness (worst):                   0.071  0.223
-    compactness (worst):                  0.027  1.058
-    concavity (worst):                    0.0    1.252
-    concave points (worst):               0.0    0.291
-    symmetry (worst):                     0.156  0.664
-    fractal dimension (worst):            0.055  0.208
-    ===================================== ====== ======
-
-    :Missing Attribute Values: None
-
-    :Class Distribution: 212 - Malignant, 357 - Benign
-
-    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
-
-    :Donor: Nick Street
-
-    :Date: November, 1995
+:Number of Instances: 569
+
+:Number of Attributes: 30 numeric, predictive attributes and the class
+
+:Attribute Information:
+    - radius (mean of distances from center to points on the perimeter)
+    - texture (standard deviation of gray-scale values)
+    - perimeter
+    - area
+    - smoothness (local variation in radius lengths)
+    - compactness (perimeter^2 / area - 1.0)
+    - concavity (severity of concave portions of the contour)
+    - concave points (number of concave portions of the contour)
+    - symmetry
+    - fractal dimension ("coastline approximation" - 1)
+
+    The mean, standard error, and "worst" or largest (mean of the three
+    worst/largest values) of these features were computed for each image,
+    resulting in 30 features.  For instance, field 0 is Mean Radius, field
+    10 is Radius SE, field 20 is Worst Radius.
+
+    - class:
+            - WDBC-Malignant
+            - WDBC-Benign
+
+:Summary Statistics:
+
+===================================== ====== ======
+                                        Min    Max
+===================================== ====== ======
+radius (mean):                        6.981  28.11
+texture (mean):                       9.71   39.28
+perimeter (mean):                     43.79  188.5
+area (mean):                          143.5  2501.0
+smoothness (mean):                    0.053  0.163
+compactness (mean):                   0.019  0.345
+concavity (mean):                     0.0    0.427
+concave points (mean):                0.0    0.201
+symmetry (mean):                      0.106  0.304
+fractal dimension (mean):             0.05   0.097
+radius (standard error):              0.112  2.873
+texture (standard error):             0.36   4.885
+perimeter (standard error):           0.757  21.98
+area (standard error):                6.802  542.2
+smoothness (standard error):          0.002  0.031
+compactness (standard error):         0.002  0.135
+concavity (standard error):           0.0    0.396
+concave points (standard error):      0.0    0.053
+symmetry (standard error):            0.008  0.079
+fractal dimension (standard error):   0.001  0.03
+radius (worst):                       7.93   36.04
+texture (worst):                      12.02  49.54
+perimeter (worst):                    50.41  251.2
+area (worst):                         185.2  4254.0
+smoothness (worst):                   0.071  0.223
+compactness (worst):                  0.027  1.058
+concavity (worst):                    0.0    1.252
+concave points (worst):               0.0    0.291
+symmetry (worst):                     0.156  0.664
+fractal dimension (worst):            0.055  0.208
+===================================== ====== ======
+
+:Missing Attribute Values: None
+
+:Class Distribution: 212 - Malignant, 357 - Benign
+
+:Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
+
+:Donor: Nick Street
+
+:Date: November, 1995
 
 This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
 https://goo.gl/U2Uwz2
@@ -108,15 +108,15 @@ cd math-prog/cpo-dataset/machine-learn/WDBC/
 **References**
 |details-split|
 
-- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
-  for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
+- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
+  for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
   Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
   San Jose, CA, 1993.
-- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
-  prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
+- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
+  prognosis via linear programming. Operations Research, 43(4), pages 570-577,
   July-August 1995.
 - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
-  to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
+  to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
   163-171.
 
-|details-end|
\ No newline at end of file
+|details-end|
diff --git a/sklearn/datasets/descr/california_housing.rst b/sklearn/datasets/descr/california_housing.rst
index f5756533b2769..33ff111fef541 100644
--- a/sklearn/datasets/descr/california_housing.rst
+++ b/sklearn/datasets/descr/california_housing.rst
@@ -5,21 +5,21 @@ California Housing dataset
 
 **Data Set Characteristics:**
 
-    :Number of Instances: 20640
+:Number of Instances: 20640
 
-    :Number of Attributes: 8 numeric, predictive attributes and the target
+:Number of Attributes: 8 numeric, predictive attributes and the target
 
-    :Attribute Information:
-        - MedInc        median income in block group
-        - HouseAge      median house age in block group
-        - AveRooms      average number of rooms per household
-        - AveBedrms     average number of bedrooms per household
-        - Population    block group population
-        - AveOccup      average number of household members
-        - Latitude      block group latitude
-        - Longitude     block group longitude
+:Attribute Information:
+    - MedInc        median income in block group
+    - HouseAge      median house age in block group
+    - AveRooms      average number of rooms per household
+    - AveBedrms     average number of bedrooms per household
+    - Population    block group population
+    - AveOccup      average number of household members
+    - Latitude      block group latitude
+    - Longitude     block group longitude
 
-    :Missing Attribute Values: None
+:Missing Attribute Values: None
 
 This dataset was obtained from the StatLib repository.
 https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
diff --git a/sklearn/datasets/descr/covtype.rst b/sklearn/datasets/descr/covtype.rst
index 0090b8e4a6b7d..f4b752ade17a7 100644
--- a/sklearn/datasets/descr/covtype.rst
+++ b/sklearn/datasets/descr/covtype.rst
@@ -14,12 +14,12 @@ while others are discrete or continuous measurements.
 
 **Data Set Characteristics:**
 
-    =================   ============
-    Classes                        7
-    Samples total             581012
-    Dimensionality                54
-    Features                     int
-    =================   ============
+=================   ============
+Classes                        7
+Samples total             581012
+Dimensionality                54
+Features                     int
+=================   ============
 
 :func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
 it returns a dictionary-like 'Bunch' object
diff --git a/sklearn/datasets/descr/diabetes.rst b/sklearn/datasets/descr/diabetes.rst
index 173d9561bf511..b977c36cf29a0 100644
--- a/sklearn/datasets/descr/diabetes.rst
+++ b/sklearn/datasets/descr/diabetes.rst
@@ -10,23 +10,23 @@ quantitative measure of disease progression one year after baseline.
 
 **Data Set Characteristics:**
 
-  :Number of Instances: 442
-
-  :Number of Attributes: First 10 columns are numeric predictive values
-
-  :Target: Column 11 is a quantitative measure of disease progression one year after baseline
-
-  :Attribute Information:
-      - age     age in years
-      - sex
-      - bmi     body mass index
-      - bp      average blood pressure
-      - s1      tc, total serum cholesterol
-      - s2      ldl, low-density lipoproteins
-      - s3      hdl, high-density lipoproteins
-      - s4      tch, total cholesterol / HDL
-      - s5      ltg, possibly log of serum triglycerides level
-      - s6      glu, blood sugar level
+:Number of Instances: 442
+
+:Number of Attributes: First 10 columns are numeric predictive values
+
+:Target: Column 11 is a quantitative measure of disease progression one year after baseline
+
+:Attribute Information:
+    - age     age in years
+    - sex
+    - bmi     body mass index
+    - bp      average blood pressure
+    - s1      tc, total serum cholesterol
+    - s2      ldl, low-density lipoproteins
+    - s3      hdl, high-density lipoproteins
+    - s4      tch, total cholesterol / HDL
+    - s5      ltg, possibly log of serum triglycerides level
+    - s6      glu, blood sugar level
 
 Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).
 
diff --git a/sklearn/datasets/descr/digits.rst b/sklearn/datasets/descr/digits.rst
index 40d819e92b7ab..3b07233721d69 100644
--- a/sklearn/datasets/descr/digits.rst
+++ b/sklearn/datasets/descr/digits.rst
@@ -5,12 +5,12 @@ Optical recognition of handwritten digits dataset
 
 **Data Set Characteristics:**
 
-    :Number of Instances: 1797
-    :Number of Attributes: 64
-    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
-    :Missing Attribute Values: None
-    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
-    :Date: July; 1998
+:Number of Instances: 1797
+:Number of Attributes: 64
+:Attribute Information: 8x8 image of integer pixels in the range 0..16.
+:Missing Attribute Values: None
+:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
+:Date: July; 1998
 
 This is a copy of the test set of the UCI ML hand-written digits datasets
 https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
@@ -47,4 +47,4 @@ L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
 - Claudio Gentile. A New Approximate Maximal Margin Classification
   Algorithm. NIPS. 2000.
 
-|details-end|
\ No newline at end of file
+|details-end|
diff --git a/sklearn/datasets/descr/iris.rst b/sklearn/datasets/descr/iris.rst
index 02236dcb1c19f..771c92faa9899 100644
--- a/sklearn/datasets/descr/iris.rst
+++ b/sklearn/datasets/descr/iris.rst
@@ -5,34 +5,34 @@ Iris plants dataset
 
 **Data Set Characteristics:**
 
-    :Number of Instances: 150 (50 in each of three classes)
-    :Number of Attributes: 4 numeric, predictive attributes and the class
-    :Attribute Information:
-        - sepal length in cm
-        - sepal width in cm
-        - petal length in cm
-        - petal width in cm
-        - class:
-                - Iris-Setosa
-                - Iris-Versicolour
-                - Iris-Virginica
-                
-    :Summary Statistics:
+:Number of Instances: 150 (50 in each of three classes)
+:Number of Attributes: 4 numeric, predictive attributes and the class
+:Attribute Information:
+    - sepal length in cm
+    - sepal width in cm
+    - petal length in cm
+    - petal width in cm
+    - class:
+            - Iris-Setosa
+            - Iris-Versicolour
+            - Iris-Virginica
 
-    ============== ==== ==== ======= ===== ====================
-                    Min  Max   Mean    SD   Class Correlation
-    ============== ==== ==== ======= ===== ====================
-    sepal length:   4.3  7.9   5.84   0.83    0.7826
-    sepal width:    2.0  4.4   3.05   0.43   -0.4194
-    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
-    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
-    ============== ==== ==== ======= ===== ====================
+:Summary Statistics:
 
-    :Missing Attribute Values: None
-    :Class Distribution: 33.3% for each of 3 classes.
-    :Creator: R.A. Fisher
-    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
-    :Date: July, 1988
+============== ==== ==== ======= ===== ====================
+                Min  Max   Mean    SD   Class Correlation
+============== ==== ==== ======= ===== ====================
+sepal length:   4.3  7.9   5.84   0.83    0.7826
+sepal width:    2.0  4.4   3.05   0.43   -0.4194
+petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
+petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
+============== ==== ==== ======= ===== ====================
+
+:Missing Attribute Values: None
+:Class Distribution: 33.3% for each of 3 classes.
+:Creator: R.A. Fisher
+:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
+:Date: July, 1988
 
 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
 from Fisher's paper. Note that it's the same as in R, but not as in the UCI
@@ -64,4 +64,4 @@ latter are NOT linearly separable from each other.
   conceptual clustering system finds 3 classes in the data.
 - Many, many more ...
 
-|details-end|
\ No newline at end of file
+|details-end|
diff --git a/sklearn/datasets/descr/kddcup99.rst b/sklearn/datasets/descr/kddcup99.rst
index d53a7c878dd17..fe8a0c8f4168c 100644
--- a/sklearn/datasets/descr/kddcup99.rst
+++ b/sklearn/datasets/descr/kddcup99.rst
@@ -30,50 +30,50 @@ We thus transform the KDD Data set into two different data sets: SA and SF.
 * http and smtp are two subsets of SF corresponding with third feature
   equal to 'http' (resp. to 'smtp').
 
-General KDD structure :
-
-    ================      ==========================================
-    Samples total         4898431
-    Dimensionality        41
-    Features              discrete (int) or continuous (float)
-    Targets               str, 'normal.' or name of the anomaly type
-    ================      ==========================================
-
-    SA structure :
-
-    ================      ==========================================
-    Samples total         976158
-    Dimensionality        41
-    Features              discrete (int) or continuous (float)
-    Targets               str, 'normal.' or name of the anomaly type
-    ================      ==========================================
-
-    SF structure :
-
-    ================      ==========================================
-    Samples total         699691
-    Dimensionality        4
-    Features              discrete (int) or continuous (float)
-    Targets               str, 'normal.' or name of the anomaly type
-    ================      ==========================================
-
-    http structure :
-
-    ================      ==========================================
-    Samples total         619052
-    Dimensionality        3
-    Features              discrete (int) or continuous (float)
-    Targets               str, 'normal.' or name of the anomaly type
-    ================      ==========================================
-
-    smtp structure :
-
-    ================      ==========================================
-    Samples total         95373
-    Dimensionality        3
-    Features              discrete (int) or continuous (float)
-    Targets               str, 'normal.' or name of the anomaly type
-    ================      ==========================================
+General KDD structure:
+
+================      ==========================================
+Samples total         4898431
+Dimensionality        41
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
+
+SA structure:
+
+================      ==========================================
+Samples total         976158
+Dimensionality        41
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
+
+SF structure:
+
+================      ==========================================
+Samples total         699691
+Dimensionality        4
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
+
+http structure:
+
+================      ==========================================
+Samples total         619052
+Dimensionality        3
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
+
+smtp structure:
+
+================      ==========================================
+Samples total         95373
+Dimensionality        3
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
 
 :func:`sklearn.datasets.fetch_kddcup99` will load the kddcup99 dataset; it
 returns a dictionary-like object with the feature matrix in the ``data`` member
diff --git a/sklearn/datasets/descr/lfw.rst b/sklearn/datasets/descr/lfw.rst
index 8105d7d6d633a..f7d80558be373 100644
--- a/sklearn/datasets/descr/lfw.rst
+++ b/sklearn/datasets/descr/lfw.rst
@@ -6,7 +6,7 @@ The Labeled Faces in the Wild face recognition dataset
 This dataset is a collection of JPEG pictures of famous people collected
 over the internet, all details are available on the official website:
 
-    http://vis-www.cs.umass.edu/lfw/
+http://vis-www.cs.umass.edu/lfw/
 
 Each picture is centered on a single face. The typical task is called
 Face Verification: given a pair of two pictures, a binary classifier
@@ -25,12 +25,12 @@ face detector from various online websites.
 
 **Data Set Characteristics:**
 
-    =================   =======================
-    Classes                                5749
-    Samples total                         13233
-    Dimensionality                         5828
-    Features            real, between 0 and 255
-    =================   =======================
+=================   =======================
+Classes                                5749
+Samples total                         13233
+Dimensionality                         5828
+Features            real, between 0 and 255
+=================   =======================
 
 |details-start|
 **Usage**
diff --git a/sklearn/datasets/descr/linnerud.rst b/sklearn/datasets/descr/linnerud.rst
index 81c970bb6e3e6..108611a4722ad 100644
--- a/sklearn/datasets/descr/linnerud.rst
+++ b/sklearn/datasets/descr/linnerud.rst
@@ -5,9 +5,9 @@ Linnerrud dataset
 
 **Data Set Characteristics:**
 
-    :Number of Instances: 20
-    :Number of Attributes: 3
-    :Missing Attribute Values: None
+:Number of Instances: 20
+:Number of Attributes: 3
+:Missing Attribute Values: None
 
 The Linnerud dataset is a multi-output regression dataset. It consists of three
 exercise (data) and three physiological (target) variables collected from
@@ -25,4 +25,4 @@ twenty middle-aged men in a fitness club:
 * Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris:
   Editions Technic.
 
-|details-end|
\ No newline at end of file
+|details-end|
diff --git a/sklearn/datasets/descr/olivetti_faces.rst b/sklearn/datasets/descr/olivetti_faces.rst
index 4feadcc4b2fb1..060c866213e8e 100644
--- a/sklearn/datasets/descr/olivetti_faces.rst
+++ b/sklearn/datasets/descr/olivetti_faces.rst
@@ -3,7 +3,7 @@
 The Olivetti faces dataset
 --------------------------
 
-`This dataset contains a set of face images`_ taken between April 1992 and 
+`This dataset contains a set of face images`_ taken between April 1992 and
 April 1994 at AT&T Laboratories Cambridge. The
 :func:`sklearn.datasets.fetch_olivetti_faces` function is the data
 fetching / caching function that downloads the data
@@ -17,20 +17,20 @@ As described on the original website:
     subjects, the images were taken at different times, varying the lighting,
     facial expressions (open / closed eyes, smiling / not smiling) and facial
     details (glasses / no glasses). All the images were taken against a dark
-    homogeneous background with the subjects in an upright, frontal position 
+    homogeneous background with the subjects in an upright, frontal position
     (with tolerance for some side movement).
 
 **Data Set Characteristics:**
 
-    =================   =====================
-    Classes                                40
-    Samples total                         400
-    Dimensionality                       4096
-    Features            real, between 0 and 1
-    =================   =====================
+=================   =====================
+Classes                                40
+Samples total                         400
+Dimensionality                       4096
+Features            real, between 0 and 1
+=================   =====================
 
-The image is quantized to 256 grey levels and stored as unsigned 8-bit 
-integers; the loader will convert these to floating point values on the 
+The image is quantized to 256 grey levels and stored as unsigned 8-bit
+integers; the loader will convert these to floating point values on the
 interval [0, 1], which are easier to work with for many algorithms.
 
 The "target" for this database is an integer from 0 to 39 indicating the
diff --git a/sklearn/datasets/descr/rcv1.rst b/sklearn/datasets/descr/rcv1.rst
index afaadbfb45afc..7cf3730a17554 100644
--- a/sklearn/datasets/descr/rcv1.rst
+++ b/sklearn/datasets/descr/rcv1.rst
@@ -3,20 +3,20 @@
 RCV1 dataset
 ------------
 
-Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually 
-categorized newswire stories made available by Reuters, Ltd. for research 
+Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually
+categorized newswire stories made available by Reuters, Ltd. for research
 purposes. The dataset is extensively described in [1]_.
 
 **Data Set Characteristics:**
 
-    ==============     =====================
-    Classes                              103
-    Samples total                     804414
-    Dimensionality                     47236
-    Features           real, between 0 and 1
-    ==============     =====================
+==============     =====================
+Classes                              103
+Samples total                     804414
+Dimensionality                     47236
+Features           real, between 0 and 1
+==============     =====================
 
-:func:`sklearn.datasets.fetch_rcv1` will load the following 
+:func:`sklearn.datasets.fetch_rcv1` will load the following
 version: RCV1-v2, vectors, full sets, topics multilabels::
 
     >>> from sklearn.datasets import fetch_rcv1
@@ -28,32 +28,32 @@ It returns a dictionary-like object, with the following attributes:
 The feature matrix is a scipy CSR sparse matrix, with 804414 samples and
 47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors.
 A nearly chronological split is proposed in [1]_: The first 23149 samples are
-the training set. The last 781265 samples are the testing set. This follows 
-the official LYRL2004 chronological split. The array has 0.16% of non zero 
+the training set. The last 781265 samples are the testing set. This follows
+the official LYRL2004 chronological split. The array has 0.16% of non zero
 values::
 
     >>> rcv1.data.shape
     (804414, 47236)
 
 ``target``:
-The target values are stored in a scipy CSR sparse matrix, with 804414 samples 
-and 103 categories. Each sample has a value of 1 in its categories, and 0 in 
+The target values are stored in a scipy CSR sparse matrix, with 804414 samples
+and 103 categories. Each sample has a value of 1 in its categories, and 0 in
 others. The array has 3.15% of non zero values::
 
     >>> rcv1.target.shape
     (804414, 103)
 
 ``sample_id``:
-Each sample can be identified by its ID, ranging (with gaps) from 2286 
+Each sample can be identified by its ID, ranging (with gaps) from 2286
 to 810596::
 
     >>> rcv1.sample_id[:3]
     array([2286, 2287, 2288], dtype=uint32)
 
 ``target_names``:
-The target values are the topics of each sample. Each sample belongs to at 
-least one topic, and to up to 17 topics. There are 103 topics, each 
-represented by a string. Their corpus frequencies span five orders of 
+The target values are the topics of each sample. Each sample belongs to at
+least one topic, and to up to 17 topics. There are 103 topics, each
+represented by a string. Their corpus frequencies span five orders of
 magnitude, from 5 occurrences for 'GMIL', to 381327 for 'CCAT'::
 
     >>> rcv1.target_names[:3].tolist()  # doctest: +SKIP
@@ -67,6 +67,6 @@ The compressed size is about 656 MB.
 
 .. topic:: References
 
-    .. [1] Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). 
-           RCV1: A new benchmark collection for text categorization research. 
+    .. [1] Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004).
+           RCV1: A new benchmark collection for text categorization research.
            The Journal of Machine Learning Research, 5, 361-397.
diff --git a/sklearn/datasets/descr/twenty_newsgroups.rst b/sklearn/datasets/descr/twenty_newsgroups.rst
index 669e158244134..d1a049869dd7f 100644
--- a/sklearn/datasets/descr/twenty_newsgroups.rst
+++ b/sklearn/datasets/descr/twenty_newsgroups.rst
@@ -20,12 +20,12 @@ extractor.
 
 **Data Set Characteristics:**
 
-    =================   ==========
-    Classes                     20
-    Samples total            18846
-    Dimensionality               1
-    Features                  text
-    =================   ==========
+=================   ==========
+Classes                     20
+Samples total            18846
+Dimensionality               1
+Features                  text
+=================   ==========
 
 |details-start|
 **Usage**
diff --git a/sklearn/datasets/descr/wine_data.rst b/sklearn/datasets/descr/wine_data.rst
index e20efea9ba719..0325af6233c17 100644
--- a/sklearn/datasets/descr/wine_data.rst
+++ b/sklearn/datasets/descr/wine_data.rst
@@ -5,53 +5,52 @@ Wine recognition dataset
 
 **Data Set Characteristics:**
 
-    :Number of Instances: 178
-    :Number of Attributes: 13 numeric, predictive attributes and the class
-    :Attribute Information:
- 		- Alcohol
- 		- Malic acid
- 		- Ash
-		- Alcalinity of ash  
- 		- Magnesium
-		- Total phenols
- 		- Flavanoids
- 		- Nonflavanoid phenols
- 		- Proanthocyanins
-		- Color intensity
- 		- Hue
- 		- OD280/OD315 of diluted wines
- 		- Proline
-
+:Number of Instances: 178
+:Number of Attributes: 13 numeric, predictive attributes and the class
+:Attribute Information:
+    - Alcohol
+    - Malic acid
+    - Ash
+    - Alcalinity of ash
+    - Magnesium
+    - Total phenols
+    - Flavanoids
+    - Nonflavanoid phenols
+    - Proanthocyanins
+    - Color intensity
+    - Hue
+    - OD280/OD315 of diluted wines
+    - Proline
     - class:
-            - class_0
-            - class_1
-            - class_2
-		
-    :Summary Statistics:
-    
-    ============================= ==== ===== ======= =====
-                                   Min   Max   Mean     SD
-    ============================= ==== ===== ======= =====
-    Alcohol:                      11.0  14.8    13.0   0.8
-    Malic Acid:                   0.74  5.80    2.34  1.12
-    Ash:                          1.36  3.23    2.36  0.27
-    Alcalinity of Ash:            10.6  30.0    19.5   3.3
-    Magnesium:                    70.0 162.0    99.7  14.3
-    Total Phenols:                0.98  3.88    2.29  0.63
-    Flavanoids:                   0.34  5.08    2.03  1.00
-    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
-    Proanthocyanins:              0.41  3.58    1.59  0.57
-    Colour Intensity:              1.3  13.0     5.1   2.3
-    Hue:                          0.48  1.71    0.96  0.23
-    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
-    Proline:                       278  1680     746   315
-    ============================= ==== ===== ======= =====
-
-    :Missing Attribute Values: None
-    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
-    :Creator: R.A. Fisher
-    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
-    :Date: July, 1988
+        - class_0
+        - class_1
+        - class_2
+
+:Summary Statistics:
+
+============================= ==== ===== ======= =====
+                                Min   Max   Mean     SD
+============================= ==== ===== ======= =====
+Alcohol:                      11.0  14.8    13.0   0.8
+Malic Acid:                   0.74  5.80    2.34  1.12
+Ash:                          1.36  3.23    2.36  0.27
+Alcalinity of Ash:            10.6  30.0    19.5   3.3
+Magnesium:                    70.0 162.0    99.7  14.3
+Total Phenols:                0.98  3.88    2.29  0.63
+Flavanoids:                   0.34  5.08    2.03  1.00
+Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
+Proanthocyanins:              0.41  3.58    1.59  0.57
+Colour Intensity:              1.3  13.0     5.1   2.3
+Hue:                          0.48  1.71    0.96  0.23
+OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
+Proline:                       278  1680     746   315
+============================= ==== ===== ======= =====
+
+:Missing Attribute Values: None
+:Class Distribution: class_0 (59), class_1 (71), class_2 (48)
+:Creator: R.A. Fisher
+:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
+:Date: July, 1988
 
 This is a copy of UCI ML Wine recognition datasets.
 https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
@@ -61,10 +60,10 @@ region in Italy by three different cultivators. There are thirteen different
 measurements taken for different constituents found in the three types of
 wine.
 
-Original Owners: 
+Original Owners:
 
-Forina, M. et al, PARVUS - 
-An Extendible Package for Data Exploration, Classification and Correlation. 
+Forina, M. et al, PARVUS -
+An Extendible Package for Data Exploration, Classification and Correlation.
 Institute of Pharmaceutical and Food Analysis and Technologies,
 Via Brigata Salerno, 16147 Genoa, Italy.
 
@@ -72,28 +71,28 @@ Citation:
 
 Lichman, M. (2013). UCI Machine Learning Repository
 [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
-School of Information and Computer Science. 
+School of Information and Computer Science.
 
 |details-start|
 **References**
 |details-split|
 
-(1) S. Aeberhard, D. Coomans and O. de Vel, 
-Comparison of Classifiers in High Dimensional Settings, 
-Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
-Mathematics and Statistics, James Cook University of North Queensland. 
-(Also submitted to Technometrics). 
-
-The data was used with many others for comparing various 
-classifiers. The classes are separable, though only RDA 
-has achieved 100% correct classification. 
-(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
-(All results using the leave-one-out technique) 
-
-(2) S. Aeberhard, D. Coomans and O. de Vel, 
-"THE CLASSIFICATION PERFORMANCE OF RDA" 
-Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
-Mathematics and Statistics, James Cook University of North Queensland. 
+(1) S. Aeberhard, D. Coomans and O. de Vel,
+Comparison of Classifiers in High Dimensional Settings,
+Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
+Mathematics and Statistics, James Cook University of North Queensland.
+(Also submitted to Technometrics).
+
+The data was used with many others for comparing various
+classifiers. The classes are separable, though only RDA
+has achieved 100% correct classification.
+(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
+(All results using the leave-one-out technique)
+
+(2) S. Aeberhard, D. Coomans and O. de Vel,
+"THE CLASSIFICATION PERFORMANCE OF RDA"
+Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
+Mathematics and Statistics, James Cook University of North Queensland.
 (Also submitted to Journal of Chemometrics).
 
-|details-end|
\ No newline at end of file
+|details-end|