1
1
.. _faq :
2
2
3
- ===========================
3
+ ==========================
4
4
Frequently Asked Questions
5
- ===========================
5
+ ==========================
6
6
7
7
.. currentmodule :: sklearn
8
8
@@ -44,25 +44,25 @@ suite of the specific module of interest for more details.
44
44
Implementation decisions
45
45
------------------------
46
46
47
- Why is there no support for deep or reinforcement learning / Will there be support for deep or reinforcement learning in scikit-learn ?
48
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
47
+ Why is there no support for deep or reinforcement learning? Will there be such support in the future ?
48
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
49
49
50
50
Deep learning and reinforcement learning both require a rich vocabulary to
51
51
define an architecture, with deep learning additionally requiring
52
52
GPUs for efficient computing. However, neither of these fit within
53
- the design constraints of scikit-learn; as a result, deep learning
53
+ the design constraints of scikit-learn. As a result, deep learning
54
54
and reinforcement learning are currently out of scope for what
55
55
scikit-learn seeks to achieve.
56
56
57
- You can find more information about addition of gpu support at
57
+ You can find more information about the addition of GPU support at
58
58
`Will you add GPU support? `_.
59
59
60
60
Note that scikit-learn currently implements a simple multilayer perceptron
61
61
in :mod: `sklearn.neural_network `. We will only accept bug fixes for this module.
62
62
If you want to implement more complex deep learning models, please turn to
63
63
popular deep learning frameworks such as
64
64
`tensorflow <https://www.tensorflow.org/ >`_,
65
- `keras <https://keras.io/ >`_
65
+ `keras <https://keras.io/ >`_,
66
66
and `pytorch <https://pytorch.org/ >`_.
67
67
68
68
.. _adding_graphical_models :
@@ -85,12 +85,12 @@ do structured prediction:
85
85
* `pystruct <https://pystruct.github.io/ >`_ handles general structured
86
86
learning (focuses on SSVMs on arbitrary graph structures with
87
87
approximate inference; defines the notion of sample as an instance of
88
- the graph structure)
88
+ the graph structure).
89
89
90
90
* `seqlearn <https://larsmans.github.io/seqlearn/ >`_ handles sequences only
91
91
(focuses on exact inference; has HMMs, but mostly for the sake of
92
92
completeness; treats a feature vector as a sample and uses an offset encoding
93
- for the dependencies between feature vectors)
93
+ for the dependencies between feature vectors).
94
94
95
95
Why did you remove HMMs from scikit-learn?
96
96
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -112,14 +112,14 @@ Why do categorical variables need preprocessing in scikit-learn, compared to oth
112
112
113
113
Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices
114
114
of a single numeric dtype. These do not explicitly represent categorical
115
- variables at present. Thus, unlike R's data.frames or pandas.DataFrame, we
116
- require explicit conversion of categorical features to numeric values, as
115
+ variables at present. Thus, unlike R's `` data.frames `` or :class: ` pandas.DataFrame `,
116
+ we require explicit conversion of categorical features to numeric values, as
117
117
discussed in :ref: `preprocessing_categorical_features `.
118
118
See also :ref: `sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py ` for an
119
119
example of working with heterogeneous (e.g. categorical and numeric) data.
120
120
121
- Why does Scikit -learn not directly work with, for example, pandas.DataFrame?
122
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
121
+ Why does scikit -learn not directly work with, for example, :class: ` pandas.DataFrame ` ?
122
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
123
123
124
124
The homogeneous NumPy and SciPy data objects currently expected are most
125
125
efficient to process for most operations. Extensive work would also be needed
@@ -130,33 +130,29 @@ data structures.
130
130
Note however that :class: `~sklearn.compose.ColumnTransformer ` makes it
131
131
convenient to handle heterogeneous pandas dataframes by mapping homogeneous subsets of
132
132
dataframe columns selected by name or dtype to dedicated scikit-learn transformers.
133
-
134
133
Therefore :class: `~sklearn.compose.ColumnTransformer ` are often used in the first
135
134
step of scikit-learn pipelines when dealing
136
135
with heterogeneous dataframes (see :ref: `pipeline ` for more details).
137
136
138
137
See also :ref: `sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py `
139
138
for an example of working with heterogeneous (e.g. categorical and numeric) data.
140
139
141
- Do you plan to implement transform for target y in a pipeline?
142
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
143
- Currently transform only works for features X in a pipeline.
144
- There's a long-standing discussion about
145
- not being able to transform y in a pipeline.
146
- Follow on github issue
147
- `#4143 <https://github.com/scikit-learn/scikit-learn/issues/4143 >`_.
148
- Meanwhile check out
140
+ Do you plan to implement transform for target ``y `` in a pipeline?
141
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
142
+ Currently transform only works for features ``X `` in a pipeline. There's a
143
+ long-standing discussion about not being able to transform ``y `` in a pipeline.
144
+ Follow on GitHub issue :issue: `4143 `. Meanwhile, you can check out
149
145
:class: `~compose.TransformedTargetRegressor `,
150
146
`pipegraph <https://github.com/mcasl/PipeGraph >`_,
151
- `imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn >`_.
152
- Note that Scikit -learn solved for the case where y
147
+ and `imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn >`_.
148
+ Note that scikit -learn solved for the case where `` y ``
153
149
has an invertible transformation applied before training
154
- and inverted after prediction. Scikit -learn intends to solve for
155
- use cases where y should be transformed at training time
156
- and not at test time, for resampling and similar uses,
157
- like at `imbalanced-learn ` .
150
+ and inverted after prediction. scikit -learn intends to solve for
151
+ use cases where `` y `` should be transformed at training time
152
+ and not at test time, for resampling and similar uses, like at
153
+ `imbalanced-learn < https://github.com/scikit-learn-contrib/imbalanced-learn >`_ .
158
154
In general, these use cases can be solved
159
- with a custom meta estimator rather than a Pipeline
155
+ with a custom meta estimator rather than a :class: ` ~pipeline. Pipeline`.
160
156
161
157
Why are there so many different estimators for linear models?
162
158
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -174,16 +170,17 @@ each other. Let us have a look at
174
170
- :class: `~linear_model.Ridge `, L2 penalty
175
171
- :class: `~linear_model.Lasso `, L1 penalty (sparse models)
176
172
- :class: `~linear_model.ElasticNet `, L1 + L2 penalty (less sparse models)
177
- - :class: `~linear_model.SGDRegressor ` with `loss=' squared_loss' `
173
+ - :class: `~linear_model.SGDRegressor ` with `loss=" squared_loss" `
178
174
179
175
**Maintainer perspective: **
180
176
They all do in principle the same and are different only by the penalty they
181
177
impose. This, however, has a large impact on the way the underlying
182
178
optimization problem is solved. In the end, this amounts to usage of different
183
- methods and tricks from linear algebra. A special case is `SGDRegressor ` which
179
+ methods and tricks from linear algebra. A special case is
180
+ :class: `~linear_model.SGDRegressor ` which
184
181
comprises all 4 previous models and is different by the optimization procedure.
185
182
A further side effect is that the different estimators favor different data
186
- layouts (`X ` c -contiguous or f -contiguous, sparse csr or csc). This complexity
183
+ layouts (`X ` C -contiguous or F -contiguous, sparse csr or csc). This complexity
187
184
of the seemingly simple linear models is the reason for having different
188
185
estimator classes for different penalties.
189
186
@@ -230,8 +227,8 @@ this reason.
230
227
231
228
.. _new_algorithms_inclusion_criteria :
232
229
233
- What are the inclusion criteria for new algorithms ?
234
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
230
+ What are the inclusion criteria for new algorithms?
231
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
235
232
236
233
We only consider well-established algorithms for inclusion. A rule of thumb is
237
234
at least 3 years since publication, 200+ citations, and wide use and
@@ -256,8 +253,8 @@ Inclusion of a new algorithm speeding up an existing model is easier if:
256
253
- it does not introduce new hyper-parameters (as it makes the library
257
254
more future-proof),
258
255
- it is easy to document clearly when the contribution improves the speed
259
- and when it does not, for instance "when n_features >>
260
- n_samples",
256
+ and when it does not, for instance, "when `` n_features >>
257
+ n_samples `` ",
261
258
- benchmarks clearly show a speed up.
262
259
263
260
Also, note that your implementation need not be in scikit-learn to be used
@@ -282,7 +279,7 @@ at which point the original author might long have lost interest.
282
279
See also :ref: `new_algorithms_inclusion_criteria `. For a great read about
283
280
long-term maintenance issues in open-source software, look at
284
281
`the Executive Summary of Roads and Bridges
285
- <https://www.fordfoundation.org/media/2976/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=8> `_
282
+ <https://www.fordfoundation.org/media/2976/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=8> `_.
286
283
287
284
288
285
Using scikit-learn
@@ -299,30 +296,28 @@ with the ``[scikit-learn]`` and ``[python]`` tags. You can alternatively use the
299
296
300
297
Please make sure to include a minimal reproduction code snippet (ideally shorter
301
298
than 10 lines) that highlights your problem on a toy dataset (for instance from
302
- `` sklearn.datasets ` ` or randomly generated with functions of ``numpy.random `` with
299
+ :mod: ` sklearn.datasets ` or randomly generated with functions of ``numpy.random `` with
303
300
a fixed random seed). Please remove any line of code that is not necessary to
304
301
reproduce your problem.
305
302
306
303
The problem should be reproducible by simply copy-pasting your code snippet in a Python
307
304
shell with scikit-learn installed. Do not forget to include the import statements.
308
-
309
305
More guidance to write good reproduction code snippets can be found at:
310
-
311
- https://stackoverflow.com/help/mcve
306
+ https://stackoverflow.com/help/mcve.
312
307
313
308
If your problem raises an exception that you do not understand (even after googling it),
314
309
please make sure to include the full traceback that you obtain when running the
315
310
reproduction script.
316
311
317
312
For bug reports or feature requests, please make use of the
318
313
`issue tracker on GitHub <https://github.com/scikit-learn/scikit-learn/issues >`_.
319
-
320
314
There is also a `scikit-learn Gitter channel
321
315
<https://gitter.im/scikit-learn/scikit-learn> `_ where some users and developers
322
316
might be found.
323
317
324
- **Please do not email any authors directly to ask for assistance, report bugs,
325
- or for any other issue related to scikit-learn. **
318
+ .. warning ::
319
+ Please do not email any authors directly to ask for assistance, report bugs,
320
+ or for any other issue related to scikit-learn.
326
321
327
322
How should I save, export or deploy estimators for production?
328
323
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -336,15 +331,15 @@ Bunch objects are sometimes used as an output for functions and methods. They
336
331
extend dictionaries by enabling values to be accessed by key,
337
332
`bunch["value_key"] `, or by an attribute, `bunch.value_key `.
338
333
339
- They should not be used as an input; therefore you almost never need to create
340
- a `` Bunch `` object, unless you are extending the scikit-learn's API.
334
+ They should not be used as an input. Therefore you almost never need to create
335
+ a :class: ` ~utils. Bunch ` object, unless you are extending scikit-learn's API.
341
336
342
337
How can I load my own datasets into a format usable by scikit-learn?
343
338
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
344
339
345
340
Generally, scikit-learn works on any numeric data stored as numpy arrays
346
341
or scipy sparse matrices. Other types that are convertible to numeric
347
- arrays such as pandas DataFrame are also acceptable.
342
+ arrays such as :class: ` pandas. DataFrame ` are also acceptable.
348
343
349
344
For more information on loading your data files into these usable data
350
345
structures, please refer to :ref: `loading external datasets <external_datasets >`.
@@ -363,23 +358,23 @@ For more general feature extraction from any kind of data, see
363
358
364
359
Another common case is when you have non-numerical data and a custom distance
365
360
(or similarity) metric on these data. Examples include strings with edit
366
- distance (aka. Levenshtein distance; e.g. , DNA or RNA sequences) . These can be
361
+ distance (aka. Levenshtein distance), for instance , DNA or RNA sequences. These can be
367
362
encoded as numbers, but doing so is painful and error-prone. Working with
368
363
distance metrics on arbitrary data can be done in two ways.
369
364
370
365
Firstly, many estimators take precomputed distance/similarity matrices, so if
371
366
the dataset is not too large, you can compute distances for all pairs of inputs.
372
367
If the dataset is large, you can use feature vectors with only one "feature",
373
368
which is an index into a separate data structure, and supply a custom metric
374
- function that looks up the actual data in this data structure. E.g. , to use
375
- DBSCAN with Levenshtein distances::
369
+ function that looks up the actual data in this data structure. For instance , to use
370
+ :class: ` ~cluster.dbscan ` with Levenshtein distances::
376
371
377
- >>> from leven import levenshtein # doctest: +SKIP
378
372
>>> import numpy as np
373
+ >>> from leven import levenshtein # doctest: +SKIP
379
374
>>> from sklearn.cluster import dbscan
380
375
>>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
381
376
>>> def lev_metric(x, y):
382
- ... i, j = int(x[0]), int(y[0]) # extract indices
377
+ ... i, j = int(x[0]), int(y[0]) # extract indices
383
378
... return levenshtein(data[i], data[j])
384
379
...
385
380
>>> X = np.arange(len(data)).reshape(-1, 1)
@@ -389,25 +384,24 @@ DBSCAN with Levenshtein distances::
389
384
[2]])
390
385
>>> # We need to specify algorithm='brute' as the default assumes
391
386
>>> # a continuous feature space.
392
- >>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute')
393
- ... # doctest: +SKIP
394
- ([0, 1], array([ 0, 0, -1]))
395
-
396
- (This uses the third-party edit distance package ``leven ``.)
387
+ >>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute') # doctest: +SKIP
388
+ (array([0, 1]), array([ 0, 0, -1]))
397
389
398
- Similar tricks can be used, with some care, for tree kernels, graph kernels,
399
- etc.
390
+ Note that the example above uses the third-party edit distance package
391
+ `leven <https://pypi.org/project/leven/ >`_. Similar tricks can be used,
392
+ with some care, for tree kernels, graph kernels, etc.
400
393
401
- Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
402
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
394
+ Why do I sometime get a crash/freeze with `` n_jobs > 1 `` under OSX or Linux?
395
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
403
396
404
- Several scikit-learn tools such as ``GridSearchCV `` and ``cross_val_score ``
405
- rely internally on Python's `multiprocessing ` module to parallelize execution
397
+ Several scikit-learn tools such as :class: `~model_selection.GridSearchCV ` and
398
+ :class: `~model_selection.cross_val_score ` rely internally on Python's
399
+ :mod: `multiprocessing ` module to parallelize execution
406
400
onto several Python processes by passing ``n_jobs > 1 `` as an argument.
407
401
408
- The problem is that Python `` multiprocessing ` ` does a ``fork `` system call
402
+ The problem is that Python :mod: ` multiprocessing ` does a ``fork `` system call
409
403
without following it with an ``exec `` system call for performance reasons. Many
410
- libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
404
+ libraries like (some versions of) Accelerate or vecLib under OSX, (some versions
411
405
of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others),
412
406
manage their own internal thread pool. Upon a call to `fork `, the thread pool
413
407
state in the child process is corrupted: the thread pool believes it has many
@@ -418,30 +412,30 @@ main since 0.2.10) and we contributed a `patch
418
412
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035> `_ to GCC's OpenMP runtime
419
413
(not yet reviewed).
420
414
421
- But in the end the real culprit is Python's `` multiprocessing ` ` that does
415
+ But in the end the real culprit is Python's :mod: ` multiprocessing ` that does
422
416
``fork `` without ``exec `` to reduce the overhead of starting and using new
423
417
Python processes for parallel computing. Unfortunately this is a violation of
424
418
the POSIX standard and therefore some software editors like Apple refuse to
425
- consider the lack of fork-safety in Accelerate / vecLib as a bug.
419
+ consider the lack of fork-safety in Accelerate and vecLib as a bug.
426
420
427
- In Python 3.4+ it is now possible to configure `` multiprocessing ` ` to
428
- use the ' forkserver' or ' spawn' start methods (instead of the default
429
- ' fork' ) to manage the process pools. To work around this issue when
421
+ In Python 3.4+ it is now possible to configure :mod: ` multiprocessing ` to
422
+ use the `` " forkserver" `` or `` " spawn" `` start methods (instead of the default
423
+ `` " fork" `` ) to manage the process pools. To work around this issue when
430
424
using scikit-learn, you can set the ``JOBLIB_START_METHOD `` environment
431
- variable to ' forkserver' . However the user should be aware that using
432
- the ' forkserver' method prevents joblib.Parallel to call function
425
+ variable to `` " forkserver" `` . However the user should be aware that using
426
+ the `` " forkserver" `` method prevents :class: ` joblib.Parallel ` to call function
433
427
interactively defined in a shell session.
434
428
435
- If you have custom code that uses `` multiprocessing ` ` directly instead of using
436
- it via joblib you can enable the ' forkserver' mode globally for your
437
- program: Insert the following instructions in your main script::
429
+ If you have custom code that uses :mod: ` multiprocessing ` directly instead of using
430
+ it via :mod: ` joblib ` you can enable the `` " forkserver" `` mode globally for your
431
+ program. Insert the following instructions in your main script::
438
432
439
433
import multiprocessing
440
434
441
435
# other imports, custom code, load data, define model...
442
436
443
- if __name__ == ' __main__' :
444
- multiprocessing.set_start_method(' forkserver' )
437
+ if __name__ == " __main__" :
438
+ multiprocessing.set_start_method(" forkserver" )
445
439
446
440
# call scikit-learn utils with n_jobs > 1 here
447
441
@@ -450,20 +444,20 @@ documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-a
450
444
451
445
.. _faq_mkl_threading :
452
446
453
- Why does my job use more cores than specified with n_jobs?
454
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
447
+ Why does my job use more cores than specified with `` n_jobs `` ?
448
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
455
449
456
450
This is because ``n_jobs `` only controls the number of jobs for
457
- routines that are parallelized with `` joblib ` `, but parallel code can come
451
+ routines that are parallelized with :mod: ` joblib `, but parallel code can come
458
452
from other sources:
459
453
460
454
- some routines may be parallelized with OpenMP (for code written in C or
461
- Cython).
455
+ Cython),
462
456
- scikit-learn relies a lot on numpy, which in turn may rely on numerical
463
457
libraries like MKL, OpenBLAS or BLIS which can provide parallel
464
458
implementations.
465
459
466
- For more details, please refer to our :ref: `Parallelism notes <parallelism >`.
460
+ For more details, please refer to our :ref: `notes on parallelism <parallelism >`.
467
461
468
462
How do I set a ``random_state `` for an entire execution?
469
463
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0 commit comments