From a18a705231ba48f52ab1dda54fbcfc559d8df350 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 6 Sep 2019 17:01:42 -0400 Subject: [PATCH 1/9] Added getting started guide --- doc/documentation.rst | 2 +- doc/getting_started.rst | 185 ++++++++++++++++++++++++++++++++++++++++ doc/index.rst | 1 + 3 files changed, 187 insertions(+), 1 deletion(-) create mode 100644 doc/getting_started.rst diff --git a/doc/documentation.rst b/doc/documentation.rst index a55fbe37258ae..48ec6d7cd390a 100644 --- a/doc/documentation.rst +++ b/doc/documentation.rst @@ -13,7 +13,7 @@ Documentation of scikit-learn |version|
-

Quick Start

+

Getting Started

A very short introduction into machine learning problems and how to solve them using scikit-learn. Presents basic concepts and conventions. diff --git a/doc/getting_started.rst b/doc/getting_started.rst new file mode 100644 index 0000000000000..cda0b515b8e83 --- /dev/null +++ b/doc/getting_started.rst @@ -0,0 +1,185 @@ +Getting Started +=============== + +The purpose of this guide is to illustrate some of the main features that +scikit-learn provides. It assumes a very basic working knowledge of machine +learning practices (model fitting, predicting, cross-validation, etc.). +Please refer to our :ref:`installation instructions +` for installing scikit-learn. + +Scikit-learn is an open source machine learning library that supports +supervised and unsupervised learning. It provides various tools for model +fitting, data preprocessing, model selection and evaluation, and many other +utilities. + +Fitting and predicting: estimator basics +---------------------------------------- + +Scikit-learn provides dozens of built-in machine learning algorithms and +models, called :term:`estimators`. Each estimator can be fitted to some data +using its :term:`fit` method. + +Here is a simple example where we fit a +:class:`~sklearn.ensemble.RandomForestClassifier` to some very basic data:: + + >>> from sklearn.ensemble import RandomForestClassifier + >>> clf = RandomForestClassifier(random_state=0) + >>> X = [[1, 2, 3], [11, 12, 13]] # 2 samples, 3 features + >>> y = [0, 1] # classes of each sample + >>> clf.fit(X, y) + RandomForestClassifier(random_state=0) + +The :term:`fit` method typically accepts 2 inputs: + +- The samples matrix (or design matrix) :term:`X`. The size of ``X`` + is ``(n_samples, n_features)``, which means that samples are represented + as rows and features are represented as columns. +- The target values :term:`y` which are real number for regression tasks, or + integer for classification (they can also be strings). For unsupervized + learning tasks, ``y`` need not be specified. ``y`` is usually 1d array where + the ith entry corresponds to the target of the ith sample (row) of ``X``. + +Both ``X`` and ``y`` are usually expected to be numpy arrays, though other +formats are also sometimes accepted (e.g. sparse matrices and the more +general :term:`array-like`). + +Once the estimator is fitted, it can be used for predicting target values of +new data. You don't need to re-train the estimator:: + + >>> clf.predict(X) # predict classes of the training data + array([0, 1]) + >>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data + array([0, 1]) + +Transformers and Pipelines +-------------------------- + +Machine learning models are often composed of different parts. A typical +model consists of a pre-processing step that transforms the data, and a +final predictor that predicts target values. + +In scikit-learn, pre-processors and transformers follow the same API as the +estimators objects (they actually all inherit from the same +``BaseEstimator`` class). These objects don't have a :term:`predict` method +but rather a :term:`transform` method that outputs a newly transformed +sample matrix ``X``:: + + >>> from sklearn.preprocessing import StandardScaler + >>> X = [[0, 15], + ... [1, -10]] + >>> StandardScaler().fit(X).transform(X) + array([[-1., 1.], + [ 1., -1.]]) + +Transformers and estimators (predictors) can be combined together into a single +unifying object: a :class:`~sklearn.pipeline.Pipeline`. + +In the following example, we :ref:`load the Iris dataset `, split it +into train and test sets, and compute its accuracy score on the test data. The +pipeline offers the same API as a regular estimator: it can be fitted and +used for predictions with ``fit`` and ``predict``:: + + >>> from sklearn.preprocessing import StandardScaler + >>> from sklearn.linear_model import LogisticRegression + >>> from sklearn.pipeline import make_pipeline + >>> from sklearn.datasets import load_iris + >>> from sklearn.model_selection import train_test_split + >>> from sklearn.metrics import accuracy_score + ... + >>> preprocessor = StandardScaler() + >>> predictor = LogisticRegression(random_state=0) + >>> pipe = make_pipeline(preprocessor, predictor) + ... + >>> X, y = load_iris(return_X_y=True) + >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) + >>> pipe.fit(X_train, y_train) # fit the preprocessor first, then the predictor + Pipeline(steps=[('standardscaler', StandardScaler()), + ('logisticregression', LogisticRegression(random_state=0))]) + + >>> accuracy_score(pipe.predict(X_test), y_test) + 0.97... + +Model evaluation +---------------- + +Once a model has been fitted, it is natural to evaluate its performance on +unseen data. We have just seen the +:func:`~sklearn.model_selection.train_test_split` helper, but scikit-learn +provides many other tools for model evaluation, in particular for +:ref:`cross-validation `. + +We here briefly show how to perform a 3-folds cross-validation using the +:func:`~sklearn.model_selection.cross_validate` helper. Note that it is also +possible to manually iterate over the folds, and to use different data +splitting strategies. Please refer to our :ref:`User Guide +` for more details. + + >>> from sklearn.datasets import make_regression + >>> from sklearn.linear_model import LinearRegression + >>> from sklearn.model_selection import KFold + >>> from sklearn.model_selection import cross_validate + ... + >>> X, y = make_regression(n_samples=1000, random_state=0) + >>> lr = LinearRegression() + >>> cv = KFold(n_splits=3) + ... + >>> result = cross_validate(lr, X, y, cv=cv) + >>> result['test_score'] + array([1., 1., 1.]) + +Automatic parameter searches +---------------------------- + +All estimators have parameters that can be tuned for the predictions to be as +good as possible. For example a +:class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators`` +parameters that determines the number of trees in the forest, and a +``max_depth`` parameter that determines the maximum depth of each tree. +Quite often, it is not clear what the exact values of these parameters should +be since they depend on the data at hand. + +Scikit-learn provides tools to automatically search the best parameter +combinations (via cross-validation). In the following example, we randomly +search over the parameter state of a random forest with a +:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search +is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as +a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with +the best set of parameters. Read more in the :ref:`User Guide +`:: + + >>> from sklearn.datasets.california_housing import fetch_california_housing + >>> from sklearn.ensemble import RandomForestRegressor + >>> from sklearn.model_selection import RandomizedSearchCV + >>> from sklearn.model_selection import train_test_split + >>> from scipy.stats import randint + ... + >>> X, y = fetch_california_housing(return_X_y=True) + >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) + + >>> forest = RandomForestRegressor(random_state=0) + >>> param_distributions = {'n_estimators': randint(1, 5), + ... 'max_depth': randint(5, 10)} + >>> search = RandomizedSearchCV(forest, + ... n_iter=5, + ... param_distributions=param_distributions, + ... random_state=0) + ... + >>> search.fit(X_train, y_train) + RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5, + param_distributions={'max_depth': ..., + 'n_estimators': ...}, + random_state=0) + + >>> search.best_params_ + {'max_depth': 9, 'n_estimators': 4} + >>> search.score(X_test, y_test) + 0.73... + + +Next steps +---------- + +This guide should give you an overview of some of the main features of the +library, but there is much to scikit-learn! Please refer to our +:ref:`user_guide` for details on all the tools that we provide. You can also +find an exhaustive list of the public API in the :ref:`api_ref`. diff --git a/doc/index.rst b/doc/index.rst index 70115ffe8a743..20dc6cb3bc885 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -349,6 +349,7 @@ preface tutorial/index + getting_started user_guide glossary auto_examples/index From dc4360e023429e0d9f370d18b50d9e6f6afc55ab Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Mon, 9 Sep 2019 12:42:13 -0400 Subject: [PATCH 2/9] addressed comments --- doc/getting_started.rst | 112 ++++++++++++++++++++++------------------ 1 file changed, 61 insertions(+), 51 deletions(-) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index cda0b515b8e83..0cdc32b690efd 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -2,20 +2,20 @@ Getting Started =============== The purpose of this guide is to illustrate some of the main features that -scikit-learn provides. It assumes a very basic working knowledge of machine -learning practices (model fitting, predicting, cross-validation, etc.). -Please refer to our :ref:`installation instructions -` for installing scikit-learn. +``scikit-learn`` provides. It assumes a very basic working knowledge of +machine learning practices (model fitting, predicting, cross-validation, +etc.). Please refer to our :ref:`installation instructions +` for installing ``scikit-learn``. -Scikit-learn is an open source machine learning library that supports -supervised and unsupervised learning. It provides various tools for model -fitting, data preprocessing, model selection and evaluation, and many other -utilities. +``Scikit-learn`` is an open source machine learning library that supports +supervised and unsupervised learning. It also provides various tools for +model fitting, data preprocessing, model selection and evaluation, and many +other utilities. Fitting and predicting: estimator basics ---------------------------------------- -Scikit-learn provides dozens of built-in machine learning algorithms and +``Scikit-learn`` provides dozens of built-in machine learning algorithms and models, called :term:`estimators`. Each estimator can be fitted to some data using its :term:`fit` method. @@ -29,19 +29,20 @@ Here is a simple example where we fit a >>> clf.fit(X, y) RandomForestClassifier(random_state=0) -The :term:`fit` method typically accepts 2 inputs: +The :term:`fit` method generally accepts 2 inputs: - The samples matrix (or design matrix) :term:`X`. The size of ``X`` - is ``(n_samples, n_features)``, which means that samples are represented - as rows and features are represented as columns. -- The target values :term:`y` which are real number for regression tasks, or - integer for classification (they can also be strings). For unsupervized - learning tasks, ``y`` need not be specified. ``y`` is usually 1d array where - the ith entry corresponds to the target of the ith sample (row) of ``X``. - -Both ``X`` and ``y`` are usually expected to be numpy arrays, though other -formats are also sometimes accepted (e.g. sparse matrices and the more -general :term:`array-like`). + is typically ``(n_samples, n_features)``, which means that samples are + represented as rows and features are represented as columns. +- The target values :term:`y` which are real numbers for regression tasks, or + integers for classification (they can also be strings). For unsupervized + learning tasks, ``y`` does not need to be specified. ``y`` is usually 1d + array where the ``i``th entry corresponds to the target of the ``i``th + sample (row) of ``X``. + +Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent +:term:`array-like`, data types, though some estimators work with other +formats such as sparse matrices. Once the estimator is fitted, it can be used for predicting target values of new data. You don't need to re-train the estimator:: @@ -54,12 +55,12 @@ new data. You don't need to re-train the estimator:: Transformers and Pipelines -------------------------- -Machine learning models are often composed of different parts. A typical -model consists of a pre-processing step that transforms the data, and a +Machine learning worflows are often composed of different parts. A typical +pipeline consists of a pre-processing step that transforms the data, and a final predictor that predicts target values. -In scikit-learn, pre-processors and transformers follow the same API as the -estimators objects (they actually all inherit from the same +In ``scikit-learn``, pre-processors and transformers follow the same API as +the estimator objects (they actually all inherit from the same ``BaseEstimator`` class). These objects don't have a :term:`predict` method but rather a :term:`transform` method that outputs a newly transformed sample matrix ``X``:: @@ -77,7 +78,7 @@ unifying object: a :class:`~sklearn.pipeline.Pipeline`. In the following example, we :ref:`load the Iris dataset `, split it into train and test sets, and compute its accuracy score on the test data. The pipeline offers the same API as a regular estimator: it can be fitted and -used for predictions with ``fit`` and ``predict``:: +used for prediction with ``fit`` and ``predict``:: >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.linear_model import LogisticRegression @@ -86,9 +87,10 @@ used for predictions with ``fit`` and ``predict``:: >>> from sklearn.model_selection import train_test_split >>> from sklearn.metrics import accuracy_score ... - >>> preprocessor = StandardScaler() - >>> predictor = LogisticRegression(random_state=0) - >>> pipe = make_pipeline(preprocessor, predictor) + >>> pipe = make_pipeline( + ... StandardScaler(), + ... LogisticRegression(random_state=0) + ... ) ... >>> X, y = load_iris(return_X_y=True) >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) @@ -102,45 +104,45 @@ used for predictions with ``fit`` and ``predict``:: Model evaluation ---------------- -Once a model has been fitted, it is natural to evaluate its performance on -unseen data. We have just seen the -:func:`~sklearn.model_selection.train_test_split` helper, but scikit-learn -provides many other tools for model evaluation, in particular for -:ref:`cross-validation `. +Fitting a model to some data does not entail that it will predict well on +unseen data. This needs to be directly evaluated. We have just seen the +:func:`~sklearn.model_selection.train_test_split` helper that splits a +dataset into train and test sets, but ``scikit-learn`` provides many other +tools for model evaluation, in particular for :ref:`cross-validation +`. -We here briefly show how to perform a 3-folds cross-validation using the -:func:`~sklearn.model_selection.cross_validate` helper. Note that it is also -possible to manually iterate over the folds, and to use different data -splitting strategies. Please refer to our :ref:`User Guide -` for more details. +We here briefly show how to perform a 5-fold cross-validation procedure, +using the :func:`~sklearn.model_selection.cross_validate` helper. Note that +it is also possible to manually iterate over the folds, and to use different +data splitting strategies. Please refer to our :ref:`User Guide +` for more details:: >>> from sklearn.datasets import make_regression >>> from sklearn.linear_model import LinearRegression - >>> from sklearn.model_selection import KFold >>> from sklearn.model_selection import cross_validate ... >>> X, y = make_regression(n_samples=1000, random_state=0) >>> lr = LinearRegression() - >>> cv = KFold(n_splits=3) ... - >>> result = cross_validate(lr, X, y, cv=cv) + >>> result = cross_validate(lr, X, y) # defaults to 5-fold CV >>> result['test_score'] - array([1., 1., 1.]) + array([1., 1., 1., 1., 1.]) Automatic parameter searches ---------------------------- -All estimators have parameters that can be tuned for the predictions to be as -good as possible. For example a +All estimators have parameters (often called hyper-parameters in the +literature) that can be tuned. The generalization power of an estimator +often critically depends on a few parameters. For example a :class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators`` parameters that determines the number of trees in the forest, and a ``max_depth`` parameter that determines the maximum depth of each tree. -Quite often, it is not clear what the exact values of these parameters should -be since they depend on the data at hand. +Quite often, it is not clear what the exact values of these parameters +should be since they depend on the data at hand. -Scikit-learn provides tools to automatically search the best parameter +``Scikit-learn`` provides tools to automatically find the best parameter combinations (via cross-validation). In the following example, we randomly -search over the parameter state of a random forest with a +search over the parameter space of a random forest with a :class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with @@ -180,6 +182,14 @@ Next steps ---------- This guide should give you an overview of some of the main features of the -library, but there is much to scikit-learn! Please refer to our -:ref:`user_guide` for details on all the tools that we provide. You can also -find an exhaustive list of the public API in the :ref:`api_ref`. +library, but there is much more to ``scikit-learn``! + +Please refer to our :ref:`user_guide` for details on all the tools that we +provide. You can also find an exhaustive list of the public API in the +:ref:`api_ref`. + +You can also look at our numerous :ref:`examples ` that +illustrate the use of ``scikit-learn`` in many different contexts. + +The :ref:`tutorials ` also contain additional learning +resources. From e0d2f468af688def9286d9850b83f56b1376dcaa Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Mon, 9 Sep 2019 12:51:35 -0400 Subject: [PATCH 3/9] Added comments to code snippets --- doc/getting_started.rst | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 0cdc32b690efd..3799f2cdac06b 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -87,17 +87,20 @@ used for prediction with ``fit`` and ``predict``:: >>> from sklearn.model_selection import train_test_split >>> from sklearn.metrics import accuracy_score ... + >>> # create a pipeline object >>> pipe = make_pipeline( ... StandardScaler(), ... LogisticRegression(random_state=0) ... ) ... + >>> # Load the iris dataset and split it into train and test sets >>> X, y = load_iris(return_X_y=True) >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) - >>> pipe.fit(X_train, y_train) # fit the preprocessor first, then the predictor + ... + >>> # Fit the whole pipeline and use it for predictions + >>> pipe.fit(X_train, y_train) Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression(random_state=0))]) - >>> accuracy_score(pipe.predict(X_test), y_test) 0.97... @@ -157,23 +160,26 @@ the best set of parameters. Read more in the :ref:`User Guide ... >>> X, y = fetch_california_housing(return_X_y=True) >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) - - >>> forest = RandomForestRegressor(random_state=0) + ... + >>> # define the parameter space that will be searched over >>> param_distributions = {'n_estimators': randint(1, 5), ... 'max_depth': randint(5, 10)} - >>> search = RandomizedSearchCV(forest, + ... + >>> # now create a searchCV object and fit it to the data + >>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), ... n_iter=5, ... param_distributions=param_distributions, ... random_state=0) - ... >>> search.fit(X_train, y_train) RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5, param_distributions={'max_depth': ..., 'n_estimators': ...}, random_state=0) - >>> search.best_params_ {'max_depth': 9, 'n_estimators': 4} + + >>> # The search object now acts like a normal random forest estimator + >>> # with max_depth=9 and n_estimators=4 >>> search.score(X_test, y_test) 0.73... From 932322b175ce3e8b3f5d9f44f411bd346154cdb5 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 10 Sep 2019 09:02:07 -0400 Subject: [PATCH 4/9] nits --- doc/getting_started.rst | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 3799f2cdac06b..23d7c38142a3c 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -93,14 +93,15 @@ used for prediction with ``fit`` and ``predict``:: ... LogisticRegression(random_state=0) ... ) ... - >>> # Load the iris dataset and split it into train and test sets + >>> # load the iris dataset and split it into train and test sets >>> X, y = load_iris(return_X_y=True) >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) ... - >>> # Fit the whole pipeline and use it for predictions + >>> # fit the whole pipeline >>> pipe.fit(X_train, y_train) Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression(random_state=0))]) + >>> # we can now use it like any other estimator >>> accuracy_score(pipe.predict(X_test), y_test) 0.97... @@ -178,7 +179,7 @@ the best set of parameters. Read more in the :ref:`User Guide >>> search.best_params_ {'max_depth': 9, 'n_estimators': 4} - >>> # The search object now acts like a normal random forest estimator + >>> # the search object now acts like a normal random forest estimator >>> # with max_depth=9 and n_estimators=4 >>> search.score(X_test, y_test) 0.73... From a0359a291f4adbacfca735113ec6c5266794ac24 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 10 Sep 2019 09:37:06 -0400 Subject: [PATCH 5/9] Added note about always searching over a pipeline --- doc/getting_started.rst | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 23d7c38142a3c..7383e12ec94f0 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -184,6 +184,21 @@ the best set of parameters. Read more in the :ref:`User Guide >>> search.score(X_test, y_test) 0.73... +.. note:: + + In practice, you almost always want to :ref:`search over a pipeline + `, instead of a single estimator. One of the main + reason is that if you apply a pre-processing step to the whole dataset + without using a pipeline, and then perform any kind of cross-validation, + you would be breaking the fundamental assumption of independence between + training and testing data. Indeed, since you pre-processed the data + using the whole dataset, some information about the test sets are + available to the train sets. This will lead to over-estimating the + generaliztion power of the estimator. + + Using a pipeline for cross-validation and searching will keep you from + this common pitfall. + Next steps ---------- From 44cf8a7019fbfbc4ca5715ed3ddb9fca7878a1d3 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 10 Sep 2019 09:38:39 -0400 Subject: [PATCH 6/9] typo --- doc/getting_started.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 7383e12ec94f0..438861eb5bc18 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -194,7 +194,7 @@ the best set of parameters. Read more in the :ref:`User Guide training and testing data. Indeed, since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. This will lead to over-estimating the - generaliztion power of the estimator. + generalization power of the estimator. Using a pipeline for cross-validation and searching will keep you from this common pitfall. From 1fc48ef25480937600df7e8e1f1d9f16a18158dc Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 10 Sep 2019 13:30:11 -0400 Subject: [PATCH 7/9] minimal adjustments --- doc/getting_started.rst | 53 +++++++++++++++++++++++++---------------- 1 file changed, 33 insertions(+), 20 deletions(-) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 438861eb5bc18..3ab662aa12cad 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -24,7 +24,8 @@ Here is a simple example where we fit a >>> from sklearn.ensemble import RandomForestClassifier >>> clf = RandomForestClassifier(random_state=0) - >>> X = [[1, 2, 3], [11, 12, 13]] # 2 samples, 3 features + >>> X = [[ 1, 2, 3], # 2 samples, 3 features + ... [11, 12, 13]] >>> y = [0, 1] # classes of each sample >>> clf.fit(X, y) RandomForestClassifier(random_state=0) @@ -37,7 +38,7 @@ The :term:`fit` method generally accepts 2 inputs: - The target values :term:`y` which are real numbers for regression tasks, or integers for classification (they can also be strings). For unsupervized learning tasks, ``y`` does not need to be specified. ``y`` is usually 1d - array where the ``i``th entry corresponds to the target of the ``i``th + array where the ``i`` th entry corresponds to the target of the ``i`` th sample (row) of ``X``. Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent @@ -52,18 +53,18 @@ new data. You don't need to re-train the estimator:: >>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data array([0, 1]) -Transformers and Pipelines --------------------------- +Transformers and pre-processors +------------------------------- Machine learning worflows are often composed of different parts. A typical -pipeline consists of a pre-processing step that transforms the data, and a -final predictor that predicts target values. +pipeline consists of a pre-processing step that transforms or impute the +data, and a final predictor that predicts target values. In ``scikit-learn``, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same -``BaseEstimator`` class). These objects don't have a :term:`predict` method -but rather a :term:`transform` method that outputs a newly transformed -sample matrix ``X``:: +``BaseEstimator`` class). The transformer objects don't have a +:term:`predict` method but rather a :term:`transform` method that outputs a +newly transformed sample matrix ``X``:: >>> from sklearn.preprocessing import StandardScaler >>> X = [[0, 15], @@ -72,13 +73,23 @@ sample matrix ``X``:: array([[-1., 1.], [ 1., -1.]]) -Transformers and estimators (predictors) can be combined together into a single -unifying object: a :class:`~sklearn.pipeline.Pipeline`. +Sometimes, you want to apply different transformations to different features: +the :ref:`ColumnTransformer` is designed for these +use-cases. + +Pipelines: chaining pre-preocessors and estimators +-------------------------------------------------- + +Transformers and estimators (predictors) can be combined together into a +single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline +offers the same API as a regular estimator: it can be fitted and used for +prediction with ``fit`` and ``predict``. As we will see later, using a +pipeline will also prevent you from disclosing some testing data in your +training data. In the following example, we :ref:`load the Iris dataset `, split it -into train and test sets, and compute its accuracy score on the test data. The -pipeline offers the same API as a regular estimator: it can be fitted and -used for prediction with ``fit`` and ``predict``:: +into train and test sets, and compute the accuracy score of a pipeline on +the test data:: >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.linear_model import LogisticRegression @@ -117,9 +128,9 @@ tools for model evaluation, in particular for :ref:`cross-validation We here briefly show how to perform a 5-fold cross-validation procedure, using the :func:`~sklearn.model_selection.cross_validate` helper. Note that -it is also possible to manually iterate over the folds, and to use different -data splitting strategies. Please refer to our :ref:`User Guide -` for more details:: +it is also possible to manually iterate over the folds, use different +data splitting strategies, and use custom scoring functions. Please refer to +our :ref:`User Guide ` for more details:: >>> from sklearn.datasets import make_regression >>> from sklearn.linear_model import LinearRegression @@ -129,7 +140,7 @@ data splitting strategies. Please refer to our :ref:`User Guide >>> lr = LinearRegression() ... >>> result = cross_validate(lr, X, y) # defaults to 5-fold CV - >>> result['test_score'] + >>> result['test_score'] # r_squared score is high because dataset is easy array([1., 1., 1., 1., 1.]) Automatic parameter searches @@ -203,8 +214,10 @@ the best set of parameters. Read more in the :ref:`User Guide Next steps ---------- -This guide should give you an overview of some of the main features of the -library, but there is much more to ``scikit-learn``! +We have briefly covered estimator fitting and predicting, pre-processing +steps, pipelines, cross-validation tools and automatic hyper-parameter +searches. This guide should give you an overview of some of the main +features of the library, but there is much more to ``scikit-learn``! Please refer to our :ref:`user_guide` for details on all the tools that we provide. You can also find an exhaustive list of the public API in the From 8f42aef0d726621e419cebda81b7c79223aa3fda Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 10 Sep 2019 15:06:50 -0400 Subject: [PATCH 8/9] Addressed comments --- doc/getting_started.rst | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 3ab662aa12cad..701d3d62a1757 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -36,13 +36,13 @@ The :term:`fit` method generally accepts 2 inputs: is typically ``(n_samples, n_features)``, which means that samples are represented as rows and features are represented as columns. - The target values :term:`y` which are real numbers for regression tasks, or - integers for classification (they can also be strings). For unsupervized - learning tasks, ``y`` does not need to be specified. ``y`` is usually 1d - array where the ``i`` th entry corresponds to the target of the ``i`` th - sample (row) of ``X``. + integers for classification (or any other discrete set of values). For + unsupervized learning tasks, ``y`` does not need to be specified. ``y`` is + usually 1d array where the ``i`` th entry corresponds to the target of the + ``i`` th sample (row) of ``X``. Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent -:term:`array-like`, data types, though some estimators work with other +:term:`array-like` data types, though some estimators work with other formats such as sparse matrices. Once the estimator is fitted, it can be used for predicting target values of @@ -57,7 +57,7 @@ Transformers and pre-processors ------------------------------- Machine learning worflows are often composed of different parts. A typical -pipeline consists of a pre-processing step that transforms or impute the +pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values. In ``scikit-learn``, pre-processors and transformers follow the same API as @@ -84,8 +84,8 @@ Transformers and estimators (predictors) can be combined together into a single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with ``fit`` and ``predict``. As we will see later, using a -pipeline will also prevent you from disclosing some testing data in your -training data. +pipeline will also prevent you from data leakage, i.e. disclosing some +testing data in your training data. In the following example, we :ref:`load the Iris dataset `, split it into train and test sets, and compute the accuracy score of a pipeline on @@ -150,7 +150,7 @@ All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. The generalization power of an estimator often critically depends on a few parameters. For example a :class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators`` -parameters that determines the number of trees in the forest, and a +parameter that determines the number of trees in the forest, and a ``max_depth`` parameter that determines the maximum depth of each tree. Quite often, it is not clear what the exact values of these parameters should be since they depend on the data at hand. @@ -199,16 +199,17 @@ the best set of parameters. Read more in the :ref:`User Guide In practice, you almost always want to :ref:`search over a pipeline `, instead of a single estimator. One of the main - reason is that if you apply a pre-processing step to the whole dataset + reasons is that if you apply a pre-processing step to the whole dataset without using a pipeline, and then perform any kind of cross-validation, you would be breaking the fundamental assumption of independence between training and testing data. Indeed, since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. This will lead to over-estimating the - generalization power of the estimator. + generalization power of the estimator (you can read more in the `kaggle + post `_. - Using a pipeline for cross-validation and searching will keep you from - this common pitfall. + Using a pipeline for cross-validation and searching will largely keep + you from this common pitfall. Next steps From d947a84a7cc11b60b6e50751588ed4cff151b6b1 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 11 Sep 2019 11:32:51 -0400 Subject: [PATCH 9/9] typ --- doc/getting_started.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/getting_started.rst b/doc/getting_started.rst index 701d3d62a1757..dc1104bba4f62 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -205,8 +205,8 @@ the best set of parameters. Read more in the :ref:`User Guide training and testing data. Indeed, since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. This will lead to over-estimating the - generalization power of the estimator (you can read more in the `kaggle - post `_. + generalization power of the estimator (you can read more in this `kaggle + post `_). Using a pipeline for cross-validation and searching will largely keep you from this common pitfall.