From 5a98b7cf1cd9ea7546cde07f6f1b0afbd604f35c Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Tue, 8 Aug 2017 23:00:57 +1000 Subject: [PATCH] DOC a note on data leakage and pipeline --- doc/modules/pipeline.rst | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/doc/modules/pipeline.rst b/doc/modules/pipeline.rst index 4356b3fe8d640..232b3ed72bbda 100644 --- a/doc/modules/pipeline.rst +++ b/doc/modules/pipeline.rst @@ -16,11 +16,16 @@ into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. :class:`Pipeline` serves two purposes here: - **Convenience**: You only have to call ``fit`` and ``predict`` once on your +Convenience and encapsulation + You only have to call ``fit`` and ``predict`` once on your data to fit a whole sequence of estimators. - - **Joint parameter selection**: You can :ref:`grid search ` +Joint parameter selection + You can :ref:`grid search ` over parameters of all estimators in the pipeline at once. +Safety + Pipelines help avoid leaking statistics from your test data into the + trained model in cross-validation, by ensuring that the same samples are + used to train the transformers and predictors. All estimators in a pipeline, except the last one, must be transformers (i.e. must have a ``transform`` method).