-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] DOC New Getting Started guide #14920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a18a705
dc4360e
e0d2f46
b0082bb
932322b
a0359a2
44cf8a7
1fc48ef
15bedb4
8f42aef
d947a84
feed31c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,231 @@ | ||
Getting Started | ||
=============== | ||
|
||
The purpose of this guide is to illustrate some of the main features that | ||
``scikit-learn`` provides. It assumes a very basic working knowledge of | ||
machine learning practices (model fitting, predicting, cross-validation, | ||
etc.). Please refer to our :ref:`installation instructions | ||
<installation-instructions>` for installing ``scikit-learn``. | ||
|
||
``Scikit-learn`` is an open source machine learning library that supports | ||
supervised and unsupervised learning. It also provides various tools for | ||
model fitting, data preprocessing, model selection and evaluation, and many | ||
other utilities. | ||
|
||
Fitting and predicting: estimator basics | ||
---------------------------------------- | ||
|
||
``Scikit-learn`` provides dozens of built-in machine learning algorithms and | ||
models, called :term:`estimators`. Each estimator can be fitted to some data | ||
using its :term:`fit` method. | ||
|
||
Here is a simple example where we fit a | ||
:class:`~sklearn.ensemble.RandomForestClassifier` to some very basic data:: | ||
|
||
>>> from sklearn.ensemble import RandomForestClassifier | ||
>>> clf = RandomForestClassifier(random_state=0) | ||
>>> X = [[ 1, 2, 3], # 2 samples, 3 features | ||
... [11, 12, 13]] | ||
>>> y = [0, 1] # classes of each sample | ||
>>> clf.fit(X, y) | ||
RandomForestClassifier(random_state=0) | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The :term:`fit` method generally accepts 2 inputs: | ||
|
||
- The samples matrix (or design matrix) :term:`X`. The size of ``X`` | ||
is typically ``(n_samples, n_features)``, which means that samples are | ||
represented as rows and features are represented as columns. | ||
- The target values :term:`y` which are real numbers for regression tasks, or | ||
integers for classification (or any other discrete set of values). For | ||
unsupervized learning tasks, ``y`` does not need to be specified. ``y`` is | ||
usually 1d array where the ``i`` th entry corresponds to the target of the | ||
``i`` th sample (row) of ``X``. | ||
|
||
Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent | ||
:term:`array-like` data types, though some estimators work with other | ||
formats such as sparse matrices. | ||
|
||
Once the estimator is fitted, it can be used for predicting target values of | ||
new data. You don't need to re-train the estimator:: | ||
|
||
>>> clf.predict(X) # predict classes of the training data | ||
array([0, 1]) | ||
>>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data | ||
array([0, 1]) | ||
|
||
Transformers and pre-processors | ||
------------------------------- | ||
|
||
Machine learning worflows are often composed of different parts. A typical | ||
pipeline consists of a pre-processing step that transforms or imputes the | ||
data, and a final predictor that predicts target values. | ||
|
||
In ``scikit-learn``, pre-processors and transformers follow the same API as | ||
the estimator objects (they actually all inherit from the same | ||
``BaseEstimator`` class). The transformer objects don't have a | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
:term:`predict` method but rather a :term:`transform` method that outputs a | ||
newly transformed sample matrix ``X``:: | ||
|
||
>>> from sklearn.preprocessing import StandardScaler | ||
>>> X = [[0, 15], | ||
... [1, -10]] | ||
>>> StandardScaler().fit(X).transform(X) | ||
array([[-1., 1.], | ||
[ 1., -1.]]) | ||
|
||
Sometimes, you want to apply different transformations to different features: | ||
the :ref:`ColumnTransformer<column_transformer>` is designed for these | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
use-cases. | ||
|
||
Pipelines: chaining pre-preocessors and estimators | ||
-------------------------------------------------- | ||
|
||
Transformers and estimators (predictors) can be combined together into a | ||
single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline | ||
offers the same API as a regular estimator: it can be fitted and used for | ||
prediction with ``fit`` and ``predict``. As we will see later, using a | ||
pipeline will also prevent you from data leakage, i.e. disclosing some | ||
testing data in your training data. | ||
|
||
In the following example, we :ref:`load the Iris dataset <datasets>`, split it | ||
into train and test sets, and compute the accuracy score of a pipeline on | ||
the test data:: | ||
|
||
>>> from sklearn.preprocessing import StandardScaler | ||
>>> from sklearn.linear_model import LogisticRegression | ||
>>> from sklearn.pipeline import make_pipeline | ||
>>> from sklearn.datasets import load_iris | ||
>>> from sklearn.model_selection import train_test_split | ||
>>> from sklearn.metrics import accuracy_score | ||
... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's interesting that the doctest tests pass here. I remember they really needed a space after |
||
>>> # create a pipeline object | ||
>>> pipe = make_pipeline( | ||
... StandardScaler(), | ||
... LogisticRegression(random_state=0) | ||
... ) | ||
... | ||
>>> # load the iris dataset and split it into train and test sets | ||
>>> X, y = load_iris(return_X_y=True) | ||
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) | ||
... | ||
>>> # fit the whole pipeline | ||
>>> pipe.fit(X_train, y_train) | ||
Pipeline(steps=[('standardscaler', StandardScaler()), | ||
('logisticregression', LogisticRegression(random_state=0))]) | ||
>>> # we can now use it like any other estimator | ||
>>> accuracy_score(pipe.predict(X_test), y_test) | ||
0.97... | ||
|
||
Model evaluation | ||
---------------- | ||
|
||
Fitting a model to some data does not entail that it will predict well on | ||
unseen data. This needs to be directly evaluated. We have just seen the | ||
:func:`~sklearn.model_selection.train_test_split` helper that splits a | ||
dataset into train and test sets, but ``scikit-learn`` provides many other | ||
tools for model evaluation, in particular for :ref:`cross-validation | ||
<cross_validation>`. | ||
|
||
We here briefly show how to perform a 5-fold cross-validation procedure, | ||
using the :func:`~sklearn.model_selection.cross_validate` helper. Note that | ||
it is also possible to manually iterate over the folds, use different | ||
data splitting strategies, and use custom scoring functions. Please refer to | ||
our :ref:`User Guide <cross_validation>` for more details:: | ||
|
||
>>> from sklearn.datasets import make_regression | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
>>> from sklearn.linear_model import LinearRegression | ||
>>> from sklearn.model_selection import cross_validate | ||
... | ||
>>> X, y = make_regression(n_samples=1000, random_state=0) | ||
>>> lr = LinearRegression() | ||
... | ||
>>> result = cross_validate(lr, X, y) # defaults to 5-fold CV | ||
>>> result['test_score'] # r_squared score is high because dataset is easy | ||
array([1., 1., 1., 1., 1.]) | ||
|
||
Automatic parameter searches | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could even call this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lol, I like your subtle thumb down, what would you call it instead @jnothman ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I gave a thumbs up to the "aka hyperparameters" suggestion below. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is various terminology used in scikit-learn that is not ideal in hindsight (as language changes), but for consistency it's best to call these "parameters". I'd also consider calling it "Model selection" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Automatic parameter search and tuning" ? |
||
---------------------------- | ||
|
||
All estimators have parameters (often called hyper-parameters in the | ||
literature) that can be tuned. The generalization power of an estimator | ||
often critically depends on a few parameters. For example a | ||
:class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators`` | ||
parameter that determines the number of trees in the forest, and a | ||
``max_depth`` parameter that determines the maximum depth of each tree. | ||
Quite often, it is not clear what the exact values of these parameters | ||
should be since they depend on the data at hand. | ||
|
||
``Scikit-learn`` provides tools to automatically find the best parameter | ||
combinations (via cross-validation). In the following example, we randomly | ||
search over the parameter space of a random forest with a | ||
:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search | ||
is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as | ||
a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should use a pipeline instead of a single model, to illustrate how to treat the whole pipeline as an estimator, as well as how to pass grid search parameters of a pipeline. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ugh, I agree but I'm a bit concerned that the Would you be OK to add a note that basically says in practice you always want to use a pipeline, let alone for the fact that it prevents data leaks? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's scary. We may have to add a paragraph or two to explain how it works, but ideally at the end of this tutorial, the user has a code snippet they can copy/paste and use. I know we want to discourage people from copy pasting random code, but that's what they do. Also, if it's too scary for people to pass the pipeline parameters to a gridsearch, then we need to change the API, since that's not gonna happen (any time soon at least) I think we really should have it in the first tutorial. We also have a specific example for the combination of grid search and pipeline, we can have the code snippet here, with a bit of explanation, and then link to that example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll try to come up with something, but I really don't think we should consider the getting-started guide as a tutorial. The way I see it, the purpose of this guide is to showcase the main features of scikit-learn, ideally with as few cognitive overload as possible. IMO grid searching a pipeline adds a significant cognitive overload when the point is simply to showcase grid search. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I made this >>> from sklearn.datasets.california_housing import fetch_california_housing
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
...
>>> X, y = fetch_california_housing(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # create a pipeline
>>> pipe = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=0))
...
>>> # define the parameter space that will be searched over. We here only
>>> # search over the parameters of the random forest.
>>> param_distributions = {'randomforestregressor__n_estimators': randint(1, 5),
... 'randomforestregressor__max_depth': randint(5, 10)}
...
>>> # now create a searchCV object and fit it to the data
>>> search = RandomizedSearchCV(estimator=pipe,
... n_iter=5,
... param_distributions=param_distributions,
... random_state=0)
>>> search.fit(X_train, y_train)
RandomizedSearchCV(estimator=Pipeline(steps=[('standardscaler',
StandardScaler()),
('randomforestregressor',
RandomForestRegressor(random_state=0))]),
n_iter=5,
param_distributions={'randomforestregressor__max_depth': ...,
'randomforestregressor__n_estimators': ...},
random_state=0)
>>> search.best_params_
{'randomforestregressor__max_depth': 9, 'randomforestregressor__n_estimators': 4}
>>> # the search object now acts like a normal pipeline / estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)
0.73... I don't really like it because I think it's way too much code for illustrating the grid searching (the current example is already too big IMO), and it's also a bad example because you don't care about scaling when using a RF. I've tried to think of real-case scenarios e.g. using an imputer, but that requires much more code, and that is irrelevant w.r.t the point of the example which is to illustrate grid search (in its simplest form). I'm still open to the idea, but unless we can have a simple example, I would be -1. We need simple examples in this guide else users will be discouraged right from the start. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. related: I added a note in |
||
the best set of parameters. Read more in the :ref:`User Guide | ||
<grid_search>`:: | ||
|
||
>>> from sklearn.datasets.california_housing import fetch_california_housing | ||
>>> from sklearn.ensemble import RandomForestRegressor | ||
>>> from sklearn.model_selection import RandomizedSearchCV | ||
>>> from sklearn.model_selection import train_test_split | ||
>>> from scipy.stats import randint | ||
... | ||
>>> X, y = fetch_california_housing(return_X_y=True) | ||
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) | ||
... | ||
>>> # define the parameter space that will be searched over | ||
>>> param_distributions = {'n_estimators': randint(1, 5), | ||
... 'max_depth': randint(5, 10)} | ||
... | ||
>>> # now create a searchCV object and fit it to the data | ||
>>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), | ||
... n_iter=5, | ||
... param_distributions=param_distributions, | ||
... random_state=0) | ||
>>> search.fit(X_train, y_train) | ||
RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5, | ||
param_distributions={'max_depth': ..., | ||
'n_estimators': ...}, | ||
random_state=0) | ||
>>> search.best_params_ | ||
{'max_depth': 9, 'n_estimators': 4} | ||
|
||
>>> # the search object now acts like a normal random forest estimator | ||
>>> # with max_depth=9 and n_estimators=4 | ||
>>> search.score(X_test, y_test) | ||
0.73... | ||
|
||
.. note:: | ||
|
||
In practice, you almost always want to :ref:`search over a pipeline | ||
<composite_grid_search>`, instead of a single estimator. One of the main | ||
reasons is that if you apply a pre-processing step to the whole dataset | ||
without using a pipeline, and then perform any kind of cross-validation, | ||
you would be breaking the fundamental assumption of independence between | ||
training and testing data. Indeed, since you pre-processed the data | ||
using the whole dataset, some information about the test sets are | ||
available to the train sets. This will lead to over-estimating the | ||
generalization power of the estimator (you can read more in this `kaggle | ||
post <https://www.kaggle.com/alexisbcook/data-leakage>`_). | ||
|
||
Using a pipeline for cross-validation and searching will largely keep | ||
you from this common pitfall. | ||
|
||
|
||
Next steps | ||
---------- | ||
|
||
We have briefly covered estimator fitting and predicting, pre-processing | ||
steps, pipelines, cross-validation tools and automatic hyper-parameter | ||
searches. This guide should give you an overview of some of the main | ||
features of the library, but there is much more to ``scikit-learn``! | ||
|
||
Please refer to our :ref:`user_guide` for details on all the tools that we | ||
provide. You can also find an exhaustive list of the public API in the | ||
:ref:`api_ref`. | ||
|
||
You can also look at our numerous :ref:`examples <general_examples>` that | ||
illustrate the use of ``scikit-learn`` in many different contexts. | ||
|
||
The :ref:`tutorials <tutorial_menu>` also contain additional learning | ||
resources. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -349,6 +349,7 @@ | |
|
||
preface | ||
tutorial/index | ||
getting_started | ||
user_guide | ||
glossary | ||
auto_examples/index | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's not a bad idea to mention that it's the equivalent of the
train
method in some other libraries.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't
fit
the most common term though? I mean, scikit-learn pretty much set the standard ^^