Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 57cb9e8

Browse files
NicolasHugjnothman
authored andcommitted
[MRG] DOC New Getting Started guide (#14920)
1 parent bab5926 commit 57cb9e8

File tree

3 files changed

+233
-1
lines changed

3 files changed

+233
-1
lines changed

doc/documentation.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Documentation of scikit-learn |version|
1313
<!-- row -->
1414
<div class="row-fluid">
1515
<div class="span4 box">
16-
<h2><a href="tutorial/basic/tutorial.html">Quick Start</a></h2>
16+
<h2><a href="getting_started.html">Getting Started</a></h2>
1717
<blockquote>A very short introduction into machine learning
1818
problems and how to solve them using scikit-learn.
1919
Presents basic concepts and conventions.

doc/getting_started.rst

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
Getting Started
2+
===============
3+
4+
The purpose of this guide is to illustrate some of the main features that
5+
``scikit-learn`` provides. It assumes a very basic working knowledge of
6+
machine learning practices (model fitting, predicting, cross-validation,
7+
etc.). Please refer to our :ref:`installation instructions
8+
<installation-instructions>` for installing ``scikit-learn``.
9+
10+
``Scikit-learn`` is an open source machine learning library that supports
11+
supervised and unsupervised learning. It also provides various tools for
12+
model fitting, data preprocessing, model selection and evaluation, and many
13+
other utilities.
14+
15+
Fitting and predicting: estimator basics
16+
----------------------------------------
17+
18+
``Scikit-learn`` provides dozens of built-in machine learning algorithms and
19+
models, called :term:`estimators`. Each estimator can be fitted to some data
20+
using its :term:`fit` method.
21+
22+
Here is a simple example where we fit a
23+
:class:`~sklearn.ensemble.RandomForestClassifier` to some very basic data::
24+
25+
>>> from sklearn.ensemble import RandomForestClassifier
26+
>>> clf = RandomForestClassifier(random_state=0)
27+
>>> X = [[ 1, 2, 3], # 2 samples, 3 features
28+
... [11, 12, 13]]
29+
>>> y = [0, 1] # classes of each sample
30+
>>> clf.fit(X, y)
31+
RandomForestClassifier(random_state=0)
32+
33+
The :term:`fit` method generally accepts 2 inputs:
34+
35+
- The samples matrix (or design matrix) :term:`X`. The size of ``X``
36+
is typically ``(n_samples, n_features)``, which means that samples are
37+
represented as rows and features are represented as columns.
38+
- The target values :term:`y` which are real numbers for regression tasks, or
39+
integers for classification (or any other discrete set of values). For
40+
unsupervized learning tasks, ``y`` does not need to be specified. ``y`` is
41+
usually 1d array where the ``i`` th entry corresponds to the target of the
42+
``i`` th sample (row) of ``X``.
43+
44+
Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent
45+
:term:`array-like` data types, though some estimators work with other
46+
formats such as sparse matrices.
47+
48+
Once the estimator is fitted, it can be used for predicting target values of
49+
new data. You don't need to re-train the estimator::
50+
51+
>>> clf.predict(X) # predict classes of the training data
52+
array([0, 1])
53+
>>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data
54+
array([0, 1])
55+
56+
Transformers and pre-processors
57+
-------------------------------
58+
59+
Machine learning worflows are often composed of different parts. A typical
60+
pipeline consists of a pre-processing step that transforms or imputes the
61+
data, and a final predictor that predicts target values.
62+
63+
In ``scikit-learn``, pre-processors and transformers follow the same API as
64+
the estimator objects (they actually all inherit from the same
65+
``BaseEstimator`` class). The transformer objects don't have a
66+
:term:`predict` method but rather a :term:`transform` method that outputs a
67+
newly transformed sample matrix ``X``::
68+
69+
>>> from sklearn.preprocessing import StandardScaler
70+
>>> X = [[0, 15],
71+
... [1, -10]]
72+
>>> StandardScaler().fit(X).transform(X)
73+
array([[-1., 1.],
74+
[ 1., -1.]])
75+
76+
Sometimes, you want to apply different transformations to different features:
77+
the :ref:`ColumnTransformer<column_transformer>` is designed for these
78+
use-cases.
79+
80+
Pipelines: chaining pre-preocessors and estimators
81+
--------------------------------------------------
82+
83+
Transformers and estimators (predictors) can be combined together into a
84+
single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline
85+
offers the same API as a regular estimator: it can be fitted and used for
86+
prediction with ``fit`` and ``predict``. As we will see later, using a
87+
pipeline will also prevent you from data leakage, i.e. disclosing some
88+
testing data in your training data.
89+
90+
In the following example, we :ref:`load the Iris dataset <datasets>`, split it
91+
into train and test sets, and compute the accuracy score of a pipeline on
92+
the test data::
93+
94+
>>> from sklearn.preprocessing import StandardScaler
95+
>>> from sklearn.linear_model import LogisticRegression
96+
>>> from sklearn.pipeline import make_pipeline
97+
>>> from sklearn.datasets import load_iris
98+
>>> from sklearn.model_selection import train_test_split
99+
>>> from sklearn.metrics import accuracy_score
100+
...
101+
>>> # create a pipeline object
102+
>>> pipe = make_pipeline(
103+
... StandardScaler(),
104+
... LogisticRegression(random_state=0)
105+
... )
106+
...
107+
>>> # load the iris dataset and split it into train and test sets
108+
>>> X, y = load_iris(return_X_y=True)
109+
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
110+
...
111+
>>> # fit the whole pipeline
112+
>>> pipe.fit(X_train, y_train)
113+
Pipeline(steps=[('standardscaler', StandardScaler()),
114+
('logisticregression', LogisticRegression(random_state=0))])
115+
>>> # we can now use it like any other estimator
116+
>>> accuracy_score(pipe.predict(X_test), y_test)
117+
0.97...
118+
119+
Model evaluation
120+
----------------
121+
122+
Fitting a model to some data does not entail that it will predict well on
123+
unseen data. This needs to be directly evaluated. We have just seen the
124+
:func:`~sklearn.model_selection.train_test_split` helper that splits a
125+
dataset into train and test sets, but ``scikit-learn`` provides many other
126+
tools for model evaluation, in particular for :ref:`cross-validation
127+
<cross_validation>`.
128+
129+
We here briefly show how to perform a 5-fold cross-validation procedure,
130+
using the :func:`~sklearn.model_selection.cross_validate` helper. Note that
131+
it is also possible to manually iterate over the folds, use different
132+
data splitting strategies, and use custom scoring functions. Please refer to
133+
our :ref:`User Guide <cross_validation>` for more details::
134+
135+
>>> from sklearn.datasets import make_regression
136+
>>> from sklearn.linear_model import LinearRegression
137+
>>> from sklearn.model_selection import cross_validate
138+
...
139+
>>> X, y = make_regression(n_samples=1000, random_state=0)
140+
>>> lr = LinearRegression()
141+
...
142+
>>> result = cross_validate(lr, X, y) # defaults to 5-fold CV
143+
>>> result['test_score'] # r_squared score is high because dataset is easy
144+
array([1., 1., 1., 1., 1.])
145+
146+
Automatic parameter searches
147+
----------------------------
148+
149+
All estimators have parameters (often called hyper-parameters in the
150+
literature) that can be tuned. The generalization power of an estimator
151+
often critically depends on a few parameters. For example a
152+
:class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators``
153+
parameter that determines the number of trees in the forest, and a
154+
``max_depth`` parameter that determines the maximum depth of each tree.
155+
Quite often, it is not clear what the exact values of these parameters
156+
should be since they depend on the data at hand.
157+
158+
``Scikit-learn`` provides tools to automatically find the best parameter
159+
combinations (via cross-validation). In the following example, we randomly
160+
search over the parameter space of a random forest with a
161+
:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search
162+
is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as
163+
a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with
164+
the best set of parameters. Read more in the :ref:`User Guide
165+
<grid_search>`::
166+
167+
>>> from sklearn.datasets.california_housing import fetch_california_housing
168+
>>> from sklearn.ensemble import RandomForestRegressor
169+
>>> from sklearn.model_selection import RandomizedSearchCV
170+
>>> from sklearn.model_selection import train_test_split
171+
>>> from scipy.stats import randint
172+
...
173+
>>> X, y = fetch_california_housing(return_X_y=True)
174+
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
175+
...
176+
>>> # define the parameter space that will be searched over
177+
>>> param_distributions = {'n_estimators': randint(1, 5),
178+
... 'max_depth': randint(5, 10)}
179+
...
180+
>>> # now create a searchCV object and fit it to the data
181+
>>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
182+
... n_iter=5,
183+
... param_distributions=param_distributions,
184+
... random_state=0)
185+
>>> search.fit(X_train, y_train)
186+
RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
187+
param_distributions={'max_depth': ...,
188+
'n_estimators': ...},
189+
random_state=0)
190+
>>> search.best_params_
191+
{'max_depth': 9, 'n_estimators': 4}
192+
193+
>>> # the search object now acts like a normal random forest estimator
194+
>>> # with max_depth=9 and n_estimators=4
195+
>>> search.score(X_test, y_test)
196+
0.73...
197+
198+
.. note::
199+
200+
In practice, you almost always want to :ref:`search over a pipeline
201+
<composite_grid_search>`, instead of a single estimator. One of the main
202+
reasons is that if you apply a pre-processing step to the whole dataset
203+
without using a pipeline, and then perform any kind of cross-validation,
204+
you would be breaking the fundamental assumption of independence between
205+
training and testing data. Indeed, since you pre-processed the data
206+
using the whole dataset, some information about the test sets are
207+
available to the train sets. This will lead to over-estimating the
208+
generalization power of the estimator (you can read more in this `kaggle
209+
post <https://www.kaggle.com/alexisbcook/data-leakage>`_).
210+
211+
Using a pipeline for cross-validation and searching will largely keep
212+
you from this common pitfall.
213+
214+
215+
Next steps
216+
----------
217+
218+
We have briefly covered estimator fitting and predicting, pre-processing
219+
steps, pipelines, cross-validation tools and automatic hyper-parameter
220+
searches. This guide should give you an overview of some of the main
221+
features of the library, but there is much more to ``scikit-learn``!
222+
223+
Please refer to our :ref:`user_guide` for details on all the tools that we
224+
provide. You can also find an exhaustive list of the public API in the
225+
:ref:`api_ref`.
226+
227+
You can also look at our numerous :ref:`examples <general_examples>` that
228+
illustrate the use of ``scikit-learn`` in many different contexts.
229+
230+
The :ref:`tutorials <tutorial_menu>` also contain additional learning
231+
resources.

doc/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -349,6 +349,7 @@
349349

350350
preface
351351
tutorial/index
352+
getting_started
352353
user_guide
353354
glossary
354355
auto_examples/index

0 commit comments

Comments
 (0)