|
| 1 | +Getting Started |
| 2 | +=============== |
| 3 | + |
| 4 | +The purpose of this guide is to illustrate some of the main features that |
| 5 | +``scikit-learn`` provides. It assumes a very basic working knowledge of |
| 6 | +machine learning practices (model fitting, predicting, cross-validation, |
| 7 | +etc.). Please refer to our :ref:`installation instructions |
| 8 | +<installation-instructions>` for installing ``scikit-learn``. |
| 9 | + |
| 10 | +``Scikit-learn`` is an open source machine learning library that supports |
| 11 | +supervised and unsupervised learning. It also provides various tools for |
| 12 | +model fitting, data preprocessing, model selection and evaluation, and many |
| 13 | +other utilities. |
| 14 | + |
| 15 | +Fitting and predicting: estimator basics |
| 16 | +---------------------------------------- |
| 17 | + |
| 18 | +``Scikit-learn`` provides dozens of built-in machine learning algorithms and |
| 19 | +models, called :term:`estimators`. Each estimator can be fitted to some data |
| 20 | +using its :term:`fit` method. |
| 21 | + |
| 22 | +Here is a simple example where we fit a |
| 23 | +:class:`~sklearn.ensemble.RandomForestClassifier` to some very basic data:: |
| 24 | + |
| 25 | + >>> from sklearn.ensemble import RandomForestClassifier |
| 26 | + >>> clf = RandomForestClassifier(random_state=0) |
| 27 | + >>> X = [[ 1, 2, 3], # 2 samples, 3 features |
| 28 | + ... [11, 12, 13]] |
| 29 | + >>> y = [0, 1] # classes of each sample |
| 30 | + >>> clf.fit(X, y) |
| 31 | + RandomForestClassifier(random_state=0) |
| 32 | + |
| 33 | +The :term:`fit` method generally accepts 2 inputs: |
| 34 | + |
| 35 | +- The samples matrix (or design matrix) :term:`X`. The size of ``X`` |
| 36 | + is typically ``(n_samples, n_features)``, which means that samples are |
| 37 | + represented as rows and features are represented as columns. |
| 38 | +- The target values :term:`y` which are real numbers for regression tasks, or |
| 39 | + integers for classification (or any other discrete set of values). For |
| 40 | + unsupervized learning tasks, ``y`` does not need to be specified. ``y`` is |
| 41 | + usually 1d array where the ``i`` th entry corresponds to the target of the |
| 42 | + ``i`` th sample (row) of ``X``. |
| 43 | + |
| 44 | +Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent |
| 45 | +:term:`array-like` data types, though some estimators work with other |
| 46 | +formats such as sparse matrices. |
| 47 | + |
| 48 | +Once the estimator is fitted, it can be used for predicting target values of |
| 49 | +new data. You don't need to re-train the estimator:: |
| 50 | + |
| 51 | + >>> clf.predict(X) # predict classes of the training data |
| 52 | + array([0, 1]) |
| 53 | + >>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data |
| 54 | + array([0, 1]) |
| 55 | + |
| 56 | +Transformers and pre-processors |
| 57 | +------------------------------- |
| 58 | + |
| 59 | +Machine learning worflows are often composed of different parts. A typical |
| 60 | +pipeline consists of a pre-processing step that transforms or imputes the |
| 61 | +data, and a final predictor that predicts target values. |
| 62 | + |
| 63 | +In ``scikit-learn``, pre-processors and transformers follow the same API as |
| 64 | +the estimator objects (they actually all inherit from the same |
| 65 | +``BaseEstimator`` class). The transformer objects don't have a |
| 66 | +:term:`predict` method but rather a :term:`transform` method that outputs a |
| 67 | +newly transformed sample matrix ``X``:: |
| 68 | + |
| 69 | + >>> from sklearn.preprocessing import StandardScaler |
| 70 | + >>> X = [[0, 15], |
| 71 | + ... [1, -10]] |
| 72 | + >>> StandardScaler().fit(X).transform(X) |
| 73 | + array([[-1., 1.], |
| 74 | + [ 1., -1.]]) |
| 75 | + |
| 76 | +Sometimes, you want to apply different transformations to different features: |
| 77 | +the :ref:`ColumnTransformer<column_transformer>` is designed for these |
| 78 | +use-cases. |
| 79 | + |
| 80 | +Pipelines: chaining pre-preocessors and estimators |
| 81 | +-------------------------------------------------- |
| 82 | + |
| 83 | +Transformers and estimators (predictors) can be combined together into a |
| 84 | +single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline |
| 85 | +offers the same API as a regular estimator: it can be fitted and used for |
| 86 | +prediction with ``fit`` and ``predict``. As we will see later, using a |
| 87 | +pipeline will also prevent you from data leakage, i.e. disclosing some |
| 88 | +testing data in your training data. |
| 89 | + |
| 90 | +In the following example, we :ref:`load the Iris dataset <datasets>`, split it |
| 91 | +into train and test sets, and compute the accuracy score of a pipeline on |
| 92 | +the test data:: |
| 93 | + |
| 94 | + >>> from sklearn.preprocessing import StandardScaler |
| 95 | + >>> from sklearn.linear_model import LogisticRegression |
| 96 | + >>> from sklearn.pipeline import make_pipeline |
| 97 | + >>> from sklearn.datasets import load_iris |
| 98 | + >>> from sklearn.model_selection import train_test_split |
| 99 | + >>> from sklearn.metrics import accuracy_score |
| 100 | + ... |
| 101 | + >>> # create a pipeline object |
| 102 | + >>> pipe = make_pipeline( |
| 103 | + ... StandardScaler(), |
| 104 | + ... LogisticRegression(random_state=0) |
| 105 | + ... ) |
| 106 | + ... |
| 107 | + >>> # load the iris dataset and split it into train and test sets |
| 108 | + >>> X, y = load_iris(return_X_y=True) |
| 109 | + >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) |
| 110 | + ... |
| 111 | + >>> # fit the whole pipeline |
| 112 | + >>> pipe.fit(X_train, y_train) |
| 113 | + Pipeline(steps=[('standardscaler', StandardScaler()), |
| 114 | + ('logisticregression', LogisticRegression(random_state=0))]) |
| 115 | + >>> # we can now use it like any other estimator |
| 116 | + >>> accuracy_score(pipe.predict(X_test), y_test) |
| 117 | + 0.97... |
| 118 | + |
| 119 | +Model evaluation |
| 120 | +---------------- |
| 121 | + |
| 122 | +Fitting a model to some data does not entail that it will predict well on |
| 123 | +unseen data. This needs to be directly evaluated. We have just seen the |
| 124 | +:func:`~sklearn.model_selection.train_test_split` helper that splits a |
| 125 | +dataset into train and test sets, but ``scikit-learn`` provides many other |
| 126 | +tools for model evaluation, in particular for :ref:`cross-validation |
| 127 | +<cross_validation>`. |
| 128 | + |
| 129 | +We here briefly show how to perform a 5-fold cross-validation procedure, |
| 130 | +using the :func:`~sklearn.model_selection.cross_validate` helper. Note that |
| 131 | +it is also possible to manually iterate over the folds, use different |
| 132 | +data splitting strategies, and use custom scoring functions. Please refer to |
| 133 | +our :ref:`User Guide <cross_validation>` for more details:: |
| 134 | + |
| 135 | + >>> from sklearn.datasets import make_regression |
| 136 | + >>> from sklearn.linear_model import LinearRegression |
| 137 | + >>> from sklearn.model_selection import cross_validate |
| 138 | + ... |
| 139 | + >>> X, y = make_regression(n_samples=1000, random_state=0) |
| 140 | + >>> lr = LinearRegression() |
| 141 | + ... |
| 142 | + >>> result = cross_validate(lr, X, y) # defaults to 5-fold CV |
| 143 | + >>> result['test_score'] # r_squared score is high because dataset is easy |
| 144 | + array([1., 1., 1., 1., 1.]) |
| 145 | + |
| 146 | +Automatic parameter searches |
| 147 | +---------------------------- |
| 148 | + |
| 149 | +All estimators have parameters (often called hyper-parameters in the |
| 150 | +literature) that can be tuned. The generalization power of an estimator |
| 151 | +often critically depends on a few parameters. For example a |
| 152 | +:class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators`` |
| 153 | +parameter that determines the number of trees in the forest, and a |
| 154 | +``max_depth`` parameter that determines the maximum depth of each tree. |
| 155 | +Quite often, it is not clear what the exact values of these parameters |
| 156 | +should be since they depend on the data at hand. |
| 157 | + |
| 158 | +``Scikit-learn`` provides tools to automatically find the best parameter |
| 159 | +combinations (via cross-validation). In the following example, we randomly |
| 160 | +search over the parameter space of a random forest with a |
| 161 | +:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search |
| 162 | +is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as |
| 163 | +a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with |
| 164 | +the best set of parameters. Read more in the :ref:`User Guide |
| 165 | +<grid_search>`:: |
| 166 | + |
| 167 | + >>> from sklearn.datasets.california_housing import fetch_california_housing |
| 168 | + >>> from sklearn.ensemble import RandomForestRegressor |
| 169 | + >>> from sklearn.model_selection import RandomizedSearchCV |
| 170 | + >>> from sklearn.model_selection import train_test_split |
| 171 | + >>> from scipy.stats import randint |
| 172 | + ... |
| 173 | + >>> X, y = fetch_california_housing(return_X_y=True) |
| 174 | + >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) |
| 175 | + ... |
| 176 | + >>> # define the parameter space that will be searched over |
| 177 | + >>> param_distributions = {'n_estimators': randint(1, 5), |
| 178 | + ... 'max_depth': randint(5, 10)} |
| 179 | + ... |
| 180 | + >>> # now create a searchCV object and fit it to the data |
| 181 | + >>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), |
| 182 | + ... n_iter=5, |
| 183 | + ... param_distributions=param_distributions, |
| 184 | + ... random_state=0) |
| 185 | + >>> search.fit(X_train, y_train) |
| 186 | + RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5, |
| 187 | + param_distributions={'max_depth': ..., |
| 188 | + 'n_estimators': ...}, |
| 189 | + random_state=0) |
| 190 | + >>> search.best_params_ |
| 191 | + {'max_depth': 9, 'n_estimators': 4} |
| 192 | + |
| 193 | + >>> # the search object now acts like a normal random forest estimator |
| 194 | + >>> # with max_depth=9 and n_estimators=4 |
| 195 | + >>> search.score(X_test, y_test) |
| 196 | + 0.73... |
| 197 | + |
| 198 | +.. note:: |
| 199 | + |
| 200 | + In practice, you almost always want to :ref:`search over a pipeline |
| 201 | + <composite_grid_search>`, instead of a single estimator. One of the main |
| 202 | + reasons is that if you apply a pre-processing step to the whole dataset |
| 203 | + without using a pipeline, and then perform any kind of cross-validation, |
| 204 | + you would be breaking the fundamental assumption of independence between |
| 205 | + training and testing data. Indeed, since you pre-processed the data |
| 206 | + using the whole dataset, some information about the test sets are |
| 207 | + available to the train sets. This will lead to over-estimating the |
| 208 | + generalization power of the estimator (you can read more in this `kaggle |
| 209 | + post <https://www.kaggle.com/alexisbcook/data-leakage>`_). |
| 210 | + |
| 211 | + Using a pipeline for cross-validation and searching will largely keep |
| 212 | + you from this common pitfall. |
| 213 | + |
| 214 | + |
| 215 | +Next steps |
| 216 | +---------- |
| 217 | + |
| 218 | +We have briefly covered estimator fitting and predicting, pre-processing |
| 219 | +steps, pipelines, cross-validation tools and automatic hyper-parameter |
| 220 | +searches. This guide should give you an overview of some of the main |
| 221 | +features of the library, but there is much more to ``scikit-learn``! |
| 222 | + |
| 223 | +Please refer to our :ref:`user_guide` for details on all the tools that we |
| 224 | +provide. You can also find an exhaustive list of the public API in the |
| 225 | +:ref:`api_ref`. |
| 226 | + |
| 227 | +You can also look at our numerous :ref:`examples <general_examples>` that |
| 228 | +illustrate the use of ``scikit-learn`` in many different contexts. |
| 229 | + |
| 230 | +The :ref:`tutorials <tutorial_menu>` also contain additional learning |
| 231 | +resources. |
0 commit comments