Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 0dc0b19

Browse files
Charlie-XIAOglemaitre
authored andcommitted
DOC nitpicks on the FAQ page (#28272)
1 parent 3c4644f commit 0dc0b19

File tree

1 file changed

+76
-82
lines changed

1 file changed

+76
-82
lines changed

doc/faq.rst

Lines changed: 76 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
.. _faq:
22

3-
===========================
3+
==========================
44
Frequently Asked Questions
5-
===========================
5+
==========================
66

77
.. currentmodule:: sklearn
88

@@ -44,25 +44,25 @@ suite of the specific module of interest for more details.
4444
Implementation decisions
4545
------------------------
4646

47-
Why is there no support for deep or reinforcement learning / Will there be support for deep or reinforcement learning in scikit-learn?
48-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
47+
Why is there no support for deep or reinforcement learning? Will there be such support in the future?
48+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4949

5050
Deep learning and reinforcement learning both require a rich vocabulary to
5151
define an architecture, with deep learning additionally requiring
5252
GPUs for efficient computing. However, neither of these fit within
53-
the design constraints of scikit-learn; as a result, deep learning
53+
the design constraints of scikit-learn. As a result, deep learning
5454
and reinforcement learning are currently out of scope for what
5555
scikit-learn seeks to achieve.
5656

57-
You can find more information about addition of gpu support at
57+
You can find more information about the addition of GPU support at
5858
`Will you add GPU support?`_.
5959

6060
Note that scikit-learn currently implements a simple multilayer perceptron
6161
in :mod:`sklearn.neural_network`. We will only accept bug fixes for this module.
6262
If you want to implement more complex deep learning models, please turn to
6363
popular deep learning frameworks such as
6464
`tensorflow <https://www.tensorflow.org/>`_,
65-
`keras <https://keras.io/>`_
65+
`keras <https://keras.io/>`_,
6666
and `pytorch <https://pytorch.org/>`_.
6767

6868
.. _adding_graphical_models:
@@ -85,12 +85,12 @@ do structured prediction:
8585
* `pystruct <https://pystruct.github.io/>`_ handles general structured
8686
learning (focuses on SSVMs on arbitrary graph structures with
8787
approximate inference; defines the notion of sample as an instance of
88-
the graph structure)
88+
the graph structure).
8989

9090
* `seqlearn <https://larsmans.github.io/seqlearn/>`_ handles sequences only
9191
(focuses on exact inference; has HMMs, but mostly for the sake of
9292
completeness; treats a feature vector as a sample and uses an offset encoding
93-
for the dependencies between feature vectors)
93+
for the dependencies between feature vectors).
9494

9595
Why did you remove HMMs from scikit-learn?
9696
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -112,14 +112,14 @@ Why do categorical variables need preprocessing in scikit-learn, compared to oth
112112

113113
Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices
114114
of a single numeric dtype. These do not explicitly represent categorical
115-
variables at present. Thus, unlike R's data.frames or pandas.DataFrame, we
116-
require explicit conversion of categorical features to numeric values, as
115+
variables at present. Thus, unlike R's ``data.frames`` or :class:`pandas.DataFrame`,
116+
we require explicit conversion of categorical features to numeric values, as
117117
discussed in :ref:`preprocessing_categorical_features`.
118118
See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py` for an
119119
example of working with heterogeneous (e.g. categorical and numeric) data.
120120

121-
Why does Scikit-learn not directly work with, for example, pandas.DataFrame?
122-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
121+
Why does scikit-learn not directly work with, for example, :class:`pandas.DataFrame`?
122+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
123123

124124
The homogeneous NumPy and SciPy data objects currently expected are most
125125
efficient to process for most operations. Extensive work would also be needed
@@ -130,33 +130,29 @@ data structures.
130130
Note however that :class:`~sklearn.compose.ColumnTransformer` makes it
131131
convenient to handle heterogeneous pandas dataframes by mapping homogeneous subsets of
132132
dataframe columns selected by name or dtype to dedicated scikit-learn transformers.
133-
134133
Therefore :class:`~sklearn.compose.ColumnTransformer` are often used in the first
135134
step of scikit-learn pipelines when dealing
136135
with heterogeneous dataframes (see :ref:`pipeline` for more details).
137136

138137
See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py`
139138
for an example of working with heterogeneous (e.g. categorical and numeric) data.
140139

141-
Do you plan to implement transform for target y in a pipeline?
142-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
143-
Currently transform only works for features X in a pipeline.
144-
There's a long-standing discussion about
145-
not being able to transform y in a pipeline.
146-
Follow on github issue
147-
`#4143 <https://github.com/scikit-learn/scikit-learn/issues/4143>`_.
148-
Meanwhile check out
140+
Do you plan to implement transform for target ``y`` in a pipeline?
141+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
142+
Currently transform only works for features ``X`` in a pipeline. There's a
143+
long-standing discussion about not being able to transform ``y`` in a pipeline.
144+
Follow on GitHub issue :issue:`4143`. Meanwhile, you can check out
149145
:class:`~compose.TransformedTargetRegressor`,
150146
`pipegraph <https://github.com/mcasl/PipeGraph>`_,
151-
`imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_.
152-
Note that Scikit-learn solved for the case where y
147+
and `imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_.
148+
Note that scikit-learn solved for the case where ``y``
153149
has an invertible transformation applied before training
154-
and inverted after prediction. Scikit-learn intends to solve for
155-
use cases where y should be transformed at training time
156-
and not at test time, for resampling and similar uses,
157-
like at `imbalanced-learn`.
150+
and inverted after prediction. scikit-learn intends to solve for
151+
use cases where ``y`` should be transformed at training time
152+
and not at test time, for resampling and similar uses, like at
153+
`imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_.
158154
In general, these use cases can be solved
159-
with a custom meta estimator rather than a Pipeline
155+
with a custom meta estimator rather than a :class:`~pipeline.Pipeline`.
160156

161157
Why are there so many different estimators for linear models?
162158
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -174,16 +170,17 @@ each other. Let us have a look at
174170
- :class:`~linear_model.Ridge`, L2 penalty
175171
- :class:`~linear_model.Lasso`, L1 penalty (sparse models)
176172
- :class:`~linear_model.ElasticNet`, L1 + L2 penalty (less sparse models)
177-
- :class:`~linear_model.SGDRegressor` with `loss='squared_loss'`
173+
- :class:`~linear_model.SGDRegressor` with `loss="squared_loss"`
178174

179175
**Maintainer perspective:**
180176
They all do in principle the same and are different only by the penalty they
181177
impose. This, however, has a large impact on the way the underlying
182178
optimization problem is solved. In the end, this amounts to usage of different
183-
methods and tricks from linear algebra. A special case is `SGDRegressor` which
179+
methods and tricks from linear algebra. A special case is
180+
:class:`~linear_model.SGDRegressor` which
184181
comprises all 4 previous models and is different by the optimization procedure.
185182
A further side effect is that the different estimators favor different data
186-
layouts (`X` c-contiguous or f-contiguous, sparse csr or csc). This complexity
183+
layouts (`X` C-contiguous or F-contiguous, sparse csr or csc). This complexity
187184
of the seemingly simple linear models is the reason for having different
188185
estimator classes for different penalties.
189186

@@ -230,8 +227,8 @@ this reason.
230227

231228
.. _new_algorithms_inclusion_criteria:
232229

233-
What are the inclusion criteria for new algorithms ?
234-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
230+
What are the inclusion criteria for new algorithms?
231+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
235232

236233
We only consider well-established algorithms for inclusion. A rule of thumb is
237234
at least 3 years since publication, 200+ citations, and wide use and
@@ -256,8 +253,8 @@ Inclusion of a new algorithm speeding up an existing model is easier if:
256253
- it does not introduce new hyper-parameters (as it makes the library
257254
more future-proof),
258255
- it is easy to document clearly when the contribution improves the speed
259-
and when it does not, for instance "when n_features >>
260-
n_samples",
256+
and when it does not, for instance, "when ``n_features >>
257+
n_samples``",
261258
- benchmarks clearly show a speed up.
262259

263260
Also, note that your implementation need not be in scikit-learn to be used
@@ -282,7 +279,7 @@ at which point the original author might long have lost interest.
282279
See also :ref:`new_algorithms_inclusion_criteria`. For a great read about
283280
long-term maintenance issues in open-source software, look at
284281
`the Executive Summary of Roads and Bridges
285-
<https://www.fordfoundation.org/media/2976/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=8>`_
282+
<https://www.fordfoundation.org/media/2976/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=8>`_.
286283

287284

288285
Using scikit-learn
@@ -299,30 +296,28 @@ with the ``[scikit-learn]`` and ``[python]`` tags. You can alternatively use the
299296

300297
Please make sure to include a minimal reproduction code snippet (ideally shorter
301298
than 10 lines) that highlights your problem on a toy dataset (for instance from
302-
``sklearn.datasets`` or randomly generated with functions of ``numpy.random`` with
299+
:mod:`sklearn.datasets` or randomly generated with functions of ``numpy.random`` with
303300
a fixed random seed). Please remove any line of code that is not necessary to
304301
reproduce your problem.
305302

306303
The problem should be reproducible by simply copy-pasting your code snippet in a Python
307304
shell with scikit-learn installed. Do not forget to include the import statements.
308-
309305
More guidance to write good reproduction code snippets can be found at:
310-
311-
https://stackoverflow.com/help/mcve
306+
https://stackoverflow.com/help/mcve.
312307

313308
If your problem raises an exception that you do not understand (even after googling it),
314309
please make sure to include the full traceback that you obtain when running the
315310
reproduction script.
316311

317312
For bug reports or feature requests, please make use of the
318313
`issue tracker on GitHub <https://github.com/scikit-learn/scikit-learn/issues>`_.
319-
320314
There is also a `scikit-learn Gitter channel
321315
<https://gitter.im/scikit-learn/scikit-learn>`_ where some users and developers
322316
might be found.
323317

324-
**Please do not email any authors directly to ask for assistance, report bugs,
325-
or for any other issue related to scikit-learn.**
318+
.. warning::
319+
Please do not email any authors directly to ask for assistance, report bugs,
320+
or for any other issue related to scikit-learn.
326321

327322
How should I save, export or deploy estimators for production?
328323
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -336,15 +331,15 @@ Bunch objects are sometimes used as an output for functions and methods. They
336331
extend dictionaries by enabling values to be accessed by key,
337332
`bunch["value_key"]`, or by an attribute, `bunch.value_key`.
338333

339-
They should not be used as an input; therefore you almost never need to create
340-
a ``Bunch`` object, unless you are extending the scikit-learn's API.
334+
They should not be used as an input. Therefore you almost never need to create
335+
a :class:`~utils.Bunch` object, unless you are extending scikit-learn's API.
341336

342337
How can I load my own datasets into a format usable by scikit-learn?
343338
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
344339

345340
Generally, scikit-learn works on any numeric data stored as numpy arrays
346341
or scipy sparse matrices. Other types that are convertible to numeric
347-
arrays such as pandas DataFrame are also acceptable.
342+
arrays such as :class:`pandas.DataFrame` are also acceptable.
348343

349344
For more information on loading your data files into these usable data
350345
structures, please refer to :ref:`loading external datasets <external_datasets>`.
@@ -363,23 +358,23 @@ For more general feature extraction from any kind of data, see
363358

364359
Another common case is when you have non-numerical data and a custom distance
365360
(or similarity) metric on these data. Examples include strings with edit
366-
distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be
361+
distance (aka. Levenshtein distance), for instance, DNA or RNA sequences. These can be
367362
encoded as numbers, but doing so is painful and error-prone. Working with
368363
distance metrics on arbitrary data can be done in two ways.
369364

370365
Firstly, many estimators take precomputed distance/similarity matrices, so if
371366
the dataset is not too large, you can compute distances for all pairs of inputs.
372367
If the dataset is large, you can use feature vectors with only one "feature",
373368
which is an index into a separate data structure, and supply a custom metric
374-
function that looks up the actual data in this data structure. E.g., to use
375-
DBSCAN with Levenshtein distances::
369+
function that looks up the actual data in this data structure. For instance, to use
370+
:class:`~cluster.dbscan` with Levenshtein distances::
376371

377-
>>> from leven import levenshtein # doctest: +SKIP
378372
>>> import numpy as np
373+
>>> from leven import levenshtein # doctest: +SKIP
379374
>>> from sklearn.cluster import dbscan
380375
>>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
381376
>>> def lev_metric(x, y):
382-
... i, j = int(x[0]), int(y[0]) # extract indices
377+
... i, j = int(x[0]), int(y[0]) # extract indices
383378
... return levenshtein(data[i], data[j])
384379
...
385380
>>> X = np.arange(len(data)).reshape(-1, 1)
@@ -389,25 +384,24 @@ DBSCAN with Levenshtein distances::
389384
[2]])
390385
>>> # We need to specify algorithm='brute' as the default assumes
391386
>>> # a continuous feature space.
392-
>>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute')
393-
... # doctest: +SKIP
394-
([0, 1], array([ 0, 0, -1]))
395-
396-
(This uses the third-party edit distance package ``leven``.)
387+
>>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute') # doctest: +SKIP
388+
(array([0, 1]), array([ 0, 0, -1]))
397389

398-
Similar tricks can be used, with some care, for tree kernels, graph kernels,
399-
etc.
390+
Note that the example above uses the third-party edit distance package
391+
`leven <https://pypi.org/project/leven/>`_. Similar tricks can be used,
392+
with some care, for tree kernels, graph kernels, etc.
400393

401-
Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
402-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
394+
Why do I sometime get a crash/freeze with ``n_jobs > 1`` under OSX or Linux?
395+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
403396

404-
Several scikit-learn tools such as ``GridSearchCV`` and ``cross_val_score``
405-
rely internally on Python's `multiprocessing` module to parallelize execution
397+
Several scikit-learn tools such as :class:`~model_selection.GridSearchCV` and
398+
:class:`~model_selection.cross_val_score` rely internally on Python's
399+
:mod:`multiprocessing` module to parallelize execution
406400
onto several Python processes by passing ``n_jobs > 1`` as an argument.
407401

408-
The problem is that Python ``multiprocessing`` does a ``fork`` system call
402+
The problem is that Python :mod:`multiprocessing` does a ``fork`` system call
409403
without following it with an ``exec`` system call for performance reasons. Many
410-
libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
404+
libraries like (some versions of) Accelerate or vecLib under OSX, (some versions
411405
of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others),
412406
manage their own internal thread pool. Upon a call to `fork`, the thread pool
413407
state in the child process is corrupted: the thread pool believes it has many
@@ -418,30 +412,30 @@ main since 0.2.10) and we contributed a `patch
418412
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035>`_ to GCC's OpenMP runtime
419413
(not yet reviewed).
420414

421-
But in the end the real culprit is Python's ``multiprocessing`` that does
415+
But in the end the real culprit is Python's :mod:`multiprocessing` that does
422416
``fork`` without ``exec`` to reduce the overhead of starting and using new
423417
Python processes for parallel computing. Unfortunately this is a violation of
424418
the POSIX standard and therefore some software editors like Apple refuse to
425-
consider the lack of fork-safety in Accelerate / vecLib as a bug.
419+
consider the lack of fork-safety in Accelerate and vecLib as a bug.
426420

427-
In Python 3.4+ it is now possible to configure ``multiprocessing`` to
428-
use the 'forkserver' or 'spawn' start methods (instead of the default
429-
'fork') to manage the process pools. To work around this issue when
421+
In Python 3.4+ it is now possible to configure :mod:`multiprocessing` to
422+
use the ``"forkserver"`` or ``"spawn"`` start methods (instead of the default
423+
``"fork"``) to manage the process pools. To work around this issue when
430424
using scikit-learn, you can set the ``JOBLIB_START_METHOD`` environment
431-
variable to 'forkserver'. However the user should be aware that using
432-
the 'forkserver' method prevents joblib.Parallel to call function
425+
variable to ``"forkserver"``. However the user should be aware that using
426+
the ``"forkserver"`` method prevents :class:`joblib.Parallel` to call function
433427
interactively defined in a shell session.
434428

435-
If you have custom code that uses ``multiprocessing`` directly instead of using
436-
it via joblib you can enable the 'forkserver' mode globally for your
437-
program: Insert the following instructions in your main script::
429+
If you have custom code that uses :mod:`multiprocessing` directly instead of using
430+
it via :mod:`joblib` you can enable the ``"forkserver"`` mode globally for your
431+
program. Insert the following instructions in your main script::
438432

439433
import multiprocessing
440434

441435
# other imports, custom code, load data, define model...
442436

443-
if __name__ == '__main__':
444-
multiprocessing.set_start_method('forkserver')
437+
if __name__ == "__main__":
438+
multiprocessing.set_start_method("forkserver")
445439

446440
# call scikit-learn utils with n_jobs > 1 here
447441

@@ -450,20 +444,20 @@ documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-a
450444

451445
.. _faq_mkl_threading:
452446

453-
Why does my job use more cores than specified with n_jobs?
454-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
447+
Why does my job use more cores than specified with ``n_jobs``?
448+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
455449

456450
This is because ``n_jobs`` only controls the number of jobs for
457-
routines that are parallelized with ``joblib``, but parallel code can come
451+
routines that are parallelized with :mod:`joblib`, but parallel code can come
458452
from other sources:
459453

460454
- some routines may be parallelized with OpenMP (for code written in C or
461-
Cython).
455+
Cython),
462456
- scikit-learn relies a lot on numpy, which in turn may rely on numerical
463457
libraries like MKL, OpenBLAS or BLIS which can provide parallel
464458
implementations.
465459

466-
For more details, please refer to our :ref:`Parallelism notes <parallelism>`.
460+
For more details, please refer to our :ref:`notes on parallelism <parallelism>`.
467461

468462
How do I set a ``random_state`` for an entire execution?
469463
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

0 commit comments

Comments
 (0)