From 5a98b7cf1cd9ea7546cde07f6f1b0afbd604f35c Mon Sep 17 00:00:00 2001
From: Joel Nothman <joel.nothman@gmail.com>
Date: Tue, 8 Aug 2017 23:00:57 +1000
Subject: [PATCH] DOC a note on data leakage and pipeline

---
 doc/modules/pipeline.rst | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/doc/modules/pipeline.rst b/doc/modules/pipeline.rst
index 4356b3fe8d640..232b3ed72bbda 100644
--- a/doc/modules/pipeline.rst
+++ b/doc/modules/pipeline.rst
@@ -16,11 +16,16 @@ into one. This is useful as there is often a fixed sequence
 of steps in processing the data, for example feature selection, normalization
 and classification. :class:`Pipeline` serves two purposes here:
 
-    **Convenience**: You only have to call ``fit`` and ``predict`` once on your
+Convenience and encapsulation
+    You only have to call ``fit`` and ``predict`` once on your
     data to fit a whole sequence of estimators.
-
-    **Joint parameter selection**: You can :ref:`grid search <grid_search>`
+Joint parameter selection
+    You can :ref:`grid search <grid_search>`
     over parameters of all estimators in the pipeline at once.
+Safety
+    Pipelines help avoid leaking statistics from your test data into the
+    trained model in cross-validation, by ensuring that the same samples are
+    used to train the transformers and predictors.
 
 All estimators in a pipeline, except the last one, must be transformers
 (i.e. must have a ``transform`` method).