Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] DOC New Getting Started guide #14920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Sep 18, 2019
2 changes: 1 addition & 1 deletion doc/documentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Documentation of scikit-learn |version|
<!-- row -->
<div class="row-fluid">
<div class="span4 box">
<h2><a href="tutorial/basic/tutorial.html">Quick Start</a></h2>
<h2><a href="getting_started.html">Getting Started</a></h2>
<blockquote>A very short introduction into machine learning
problems and how to solve them using scikit-learn.
Presents basic concepts and conventions.
Expand Down
231 changes: 231 additions & 0 deletions doc/getting_started.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
Getting Started
===============

The purpose of this guide is to illustrate some of the main features that
``scikit-learn`` provides. It assumes a very basic working knowledge of
machine learning practices (model fitting, predicting, cross-validation,
etc.). Please refer to our :ref:`installation instructions
<installation-instructions>` for installing ``scikit-learn``.

``Scikit-learn`` is an open source machine learning library that supports
supervised and unsupervised learning. It also provides various tools for
model fitting, data preprocessing, model selection and evaluation, and many
other utilities.

Fitting and predicting: estimator basics
----------------------------------------

``Scikit-learn`` provides dozens of built-in machine learning algorithms and
models, called :term:`estimators`. Each estimator can be fitted to some data
using its :term:`fit` method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's not a bad idea to mention that it's the equivalent of the train method in some other libraries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't fit the most common term though? I mean, scikit-learn pretty much set the standard ^^


Here is a simple example where we fit a
:class:`~sklearn.ensemble.RandomForestClassifier` to some very basic data::

>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(random_state=0)
>>> X = [[ 1, 2, 3], # 2 samples, 3 features
... [11, 12, 13]]
>>> y = [0, 1] # classes of each sample
>>> clf.fit(X, y)
RandomForestClassifier(random_state=0)

The :term:`fit` method generally accepts 2 inputs:

- The samples matrix (or design matrix) :term:`X`. The size of ``X``
is typically ``(n_samples, n_features)``, which means that samples are
represented as rows and features are represented as columns.
- The target values :term:`y` which are real numbers for regression tasks, or
integers for classification (or any other discrete set of values). For
unsupervized learning tasks, ``y`` does not need to be specified. ``y`` is
usually 1d array where the ``i`` th entry corresponds to the target of the
``i`` th sample (row) of ``X``.

Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent
:term:`array-like` data types, though some estimators work with other
formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of
new data. You don't need to re-train the estimator::

>>> clf.predict(X) # predict classes of the training data
array([0, 1])
>>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data
array([0, 1])

Transformers and pre-processors
-------------------------------

Machine learning worflows are often composed of different parts. A typical
pipeline consists of a pre-processing step that transforms or imputes the
data, and a final predictor that predicts target values.

In ``scikit-learn``, pre-processors and transformers follow the same API as
the estimator objects (they actually all inherit from the same
``BaseEstimator`` class). The transformer objects don't have a
:term:`predict` method but rather a :term:`transform` method that outputs a
newly transformed sample matrix ``X``::

>>> from sklearn.preprocessing import StandardScaler
>>> X = [[0, 15],
... [1, -10]]
>>> StandardScaler().fit(X).transform(X)
array([[-1., 1.],
[ 1., -1.]])

Sometimes, you want to apply different transformations to different features:
the :ref:`ColumnTransformer<column_transformer>` is designed for these
use-cases.

Pipelines: chaining pre-preocessors and estimators
--------------------------------------------------

Transformers and estimators (predictors) can be combined together into a
single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline
offers the same API as a regular estimator: it can be fitted and used for
prediction with ``fit`` and ``predict``. As we will see later, using a
pipeline will also prevent you from data leakage, i.e. disclosing some
testing data in your training data.

In the following example, we :ref:`load the Iris dataset <datasets>`, split it
into train and test sets, and compute the accuracy score of a pipeline on
the test data::

>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's interesting that the doctest tests pass here. I remember they really needed a space after ...

>>> # create a pipeline object
>>> pipe = make_pipeline(
... StandardScaler(),
... LogisticRegression(random_state=0)
... )
...
>>> # load the iris dataset and split it into train and test sets
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # fit the whole pipeline
>>> pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression(random_state=0))])
>>> # we can now use it like any other estimator
>>> accuracy_score(pipe.predict(X_test), y_test)
0.97...

Model evaluation
----------------

Fitting a model to some data does not entail that it will predict well on
unseen data. This needs to be directly evaluated. We have just seen the
:func:`~sklearn.model_selection.train_test_split` helper that splits a
dataset into train and test sets, but ``scikit-learn`` provides many other
tools for model evaluation, in particular for :ref:`cross-validation
<cross_validation>`.

We here briefly show how to perform a 5-fold cross-validation procedure,
using the :func:`~sklearn.model_selection.cross_validate` helper. Note that
it is also possible to manually iterate over the folds, use different
data splitting strategies, and use custom scoring functions. Please refer to
our :ref:`User Guide <cross_validation>` for more details::

>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.model_selection import cross_validate
...
>>> X, y = make_regression(n_samples=1000, random_state=0)
>>> lr = LinearRegression()
...
>>> result = cross_validate(lr, X, y) # defaults to 5-fold CV
>>> result['test_score'] # r_squared score is high because dataset is easy
array([1., 1., 1., 1., 1.])

Automatic parameter searches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could even call this Hyper parameter tuning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, I like your subtle thumb down, what would you call it instead @jnothman ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave a thumbs up to the "aka hyperparameters" suggestion below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is various terminology used in scikit-learn that is not ideal in hindsight (as language changes), but for consistency it's best to call these "parameters". I'd also consider calling it "Model selection"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Automatic parameter search and tuning" ?

----------------------------

All estimators have parameters (often called hyper-parameters in the
literature) that can be tuned. The generalization power of an estimator
often critically depends on a few parameters. For example a
:class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators``
parameter that determines the number of trees in the forest, and a
``max_depth`` parameter that determines the maximum depth of each tree.
Quite often, it is not clear what the exact values of these parameters
should be since they depend on the data at hand.

``Scikit-learn`` provides tools to automatically find the best parameter
combinations (via cross-validation). In the following example, we randomly
search over the parameter space of a random forest with a
:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search
is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as
a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use a pipeline instead of a single model, to illustrate how to treat the whole pipeline as an estimator, as well as how to pass grid search parameters of a pipeline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I agree but I'm a bit concerned that the estimator__param logic is a bit too much/scary for such an introduction.

Would you be OK to add a note that basically says in practice you always want to use a pipeline, let alone for the fact that it prevents data leaks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's scary. We may have to add a paragraph or two to explain how it works, but ideally at the end of this tutorial, the user has a code snippet they can copy/paste and use. I know we want to discourage people from copy pasting random code, but that's what they do. Also, if it's too scary for people to pass the pipeline parameters to a gridsearch, then we need to change the API, since that's not gonna happen (any time soon at least) I think we really should have it in the first tutorial.

We also have a specific example for the combination of grid search and pipeline, we can have the code snippet here, with a bit of explanation, and then link to that example.

Copy link
Member Author

@NicolasHug NicolasHug Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to come up with something, but I really don't think we should consider the getting-started guide as a tutorial.

The way I see it, the purpose of this guide is to showcase the main features of scikit-learn, ideally with as few cognitive overload as possible. IMO grid searching a pipeline adds a significant cognitive overload when the point is simply to showcase grid search.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this

>>> from sklearn.datasets.california_housing import fetch_california_housing
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
...
>>> X, y = fetch_california_housing(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # create a pipeline
>>> pipe = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=0))
...
>>> # define the parameter space that will be searched over. We here only
>>> # search over the parameters of the random forest.
>>> param_distributions = {'randomforestregressor__n_estimators': randint(1, 5),
...                        'randomforestregressor__max_depth': randint(5, 10)}
...
>>> # now create a searchCV object and fit it to the data
>>> search = RandomizedSearchCV(estimator=pipe,
...                             n_iter=5,
...                             param_distributions=param_distributions,
...                             random_state=0)
>>> search.fit(X_train, y_train)
RandomizedSearchCV(estimator=Pipeline(steps=[('standardscaler',
                                              StandardScaler()),
                                             ('randomforestregressor',
                                              RandomForestRegressor(random_state=0))]),
                   n_iter=5,
                   param_distributions={'randomforestregressor__max_depth': ...,
                                        'randomforestregressor__n_estimators': ...},
                   random_state=0)
>>> search.best_params_
{'randomforestregressor__max_depth': 9, 'randomforestregressor__n_estimators': 4}

>>> # the search object now acts like a normal pipeline / estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)
0.73...

I don't really like it because I think it's way too much code for illustrating the grid searching (the current example is already too big IMO), and it's also a bad example because you don't care about scaling when using a RF.

I've tried to think of real-case scenarios e.g. using an imputer, but that requires much more code, and that is irrelevant w.r.t the point of the example which is to illustrate grid search (in its simplest form).

I'm still open to the idea, but unless we can have a simple example, I would be -1. We need simple examples in this guide else users will be discouraged right from the start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: I added a note in
a0359a2

the best set of parameters. Read more in the :ref:`User Guide
<grid_search>`::

>>> from sklearn.datasets.california_housing import fetch_california_housing
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
...
>>> X, y = fetch_california_housing(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # define the parameter space that will be searched over
>>> param_distributions = {'n_estimators': randint(1, 5),
... 'max_depth': randint(5, 10)}
...
>>> # now create a searchCV object and fit it to the data
>>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
... n_iter=5,
... param_distributions=param_distributions,
... random_state=0)
>>> search.fit(X_train, y_train)
RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
param_distributions={'max_depth': ...,
'n_estimators': ...},
random_state=0)
>>> search.best_params_
{'max_depth': 9, 'n_estimators': 4}

>>> # the search object now acts like a normal random forest estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)
0.73...

.. note::

In practice, you almost always want to :ref:`search over a pipeline
<composite_grid_search>`, instead of a single estimator. One of the main
reasons is that if you apply a pre-processing step to the whole dataset
without using a pipeline, and then perform any kind of cross-validation,
you would be breaking the fundamental assumption of independence between
training and testing data. Indeed, since you pre-processed the data
using the whole dataset, some information about the test sets are
available to the train sets. This will lead to over-estimating the
generalization power of the estimator (you can read more in this `kaggle
post <https://www.kaggle.com/alexisbcook/data-leakage>`_).

Using a pipeline for cross-validation and searching will largely keep
you from this common pitfall.


Next steps
----------

We have briefly covered estimator fitting and predicting, pre-processing
steps, pipelines, cross-validation tools and automatic hyper-parameter
searches. This guide should give you an overview of some of the main
features of the library, but there is much more to ``scikit-learn``!

Please refer to our :ref:`user_guide` for details on all the tools that we
provide. You can also find an exhaustive list of the public API in the
:ref:`api_ref`.

You can also look at our numerous :ref:`examples <general_examples>` that
illustrate the use of ``scikit-learn`` in many different contexts.

The :ref:`tutorials <tutorial_menu>` also contain additional learning
resources.
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,7 @@

preface
tutorial/index
getting_started
user_guide
glossary
auto_examples/index
Expand Down