Thanks to visit codestin.com
Credit goes to github.com

Skip to content

RFC: referring to Glossary to make parameter descriptions more focussed #10415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Jan 8, 2018 · 9 comments
Closed

Comments

@jnothman
Copy link
Member

jnothman commented Jan 8, 2018

I would like us to refer to the Glossary in API reference for parameter descriptions that come up frequently, or which have associated caveats that are too long for parameter descriptions, most notably n_jobs and random_state.

So instead of something like:

    random_state : int, RandomState instance or None, optional, default: None
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

in both KMeans and MiniBatchKMeans, we might have:

KMeans:
    random_state : int, RandomState instance or None (default)
        Determines random number generation for centroid initialization.
        See :term:`random_state`.


MiniBatchKMeans:
    random_state : int, RandomState instance or None (default)
        Determines random number generation for centroid initialization and
        random reassignment.  See :term:`random_state`.

One question is how much verbosity we should have in describing how the user may parametrise random_state. We could have just See :term:`random_state`., or we could have An int seeds the random number generator deterministically, while None uses the current np.random state. See :term:`random_state`.

Just as I see us trying to describe what is random about the algorithm when describing random_state, I would like to see n_jobs stating whether parallelism is only in fit, or in fit and predict, and what backend is used by default.

What do others think?

@glemaitre
Copy link
Member

IMO, I like See :term:`random_state`.. I think that it allows to highlight where the random_state or n_jobs is used inside the algorithm itself (as you mentioned).

On the side, I can recall some discussion with @lesteve IRL which point me out that developers are using the docstring directly from terminal, and for which hyper-links could make it difficult to find the detailed documentation.

However and as a bottom-line, I would go for the less verbose version since I personally think that newcomers are more prone to use online (HTML) documentation and that n_jobs and random_state are hugely redundant arguments and for which the use is probably trivial.

@lesteve
Copy link
Member

lesteve commented Jan 15, 2018

On the side, I can recall some discussion with @lesteve IRL which point me out that developers are using the docstring directly from terminal, and for which hyper-links could make it difficult to find the detailed documentation.

Fine with me, I guess people will need to learn that :term: means going to the glossary and where the glossary lives.

@jnothman
Copy link
Member Author

We could use "see the glossary"

@amueller
Copy link
Member

The issue that I see with this is that many users look at the docs via jupyter notebooks and I don't think the links will work there, right?

@jnothman
Copy link
Member Author

jnothman commented Feb 20, 2018 via email

@cmarmo
Copy link
Contributor

cmarmo commented Jan 6, 2020

Dear core-devs, after discussion with @glemaitre, this issue is indeed a good candidate for sprints. It has been splitted in more specific issues. I'm wondering if we can close this one, being an RFC on which apparently consensus has been obtained, in favour of #10548 and #14228 , for which I'm trying to summarize the list of modules that still need an update. Also, I'm checking the memory parameter that could probably benefit of the same clarification. @jnothman, WDYT? Thanks!

@cmarmo
Copy link
Contributor

cmarmo commented Jan 6, 2020

in favour of #10548 and #14228

... #14228 is indeed a bit different ... maybe I can open one specifically for referring to Glossary for n_jobs? Sorry for the noise... trying to clarify for beginners... like me...

@jnothman
Copy link
Member Author

jnothman commented Jan 7, 2020

I don't think #14228 is so different, and yes I'm happy to close this as the work is covered by those other issues mostly. But to be sure, solving these issues for an estimator requires understanding how that estimator works, and how to investigate where the randomisation/parallelisation is used. It's quite a challenging issue for a newcomer (but not inherently requiring a lot of prior scikit-learn knowledge).

Yes, happy to see memory refer to Glossary, but it's much less frequent and much less ambiguous as to what it's used for.

@jnothman jnothman closed this as completed Jan 7, 2020
@Kshitij68
Copy link
Contributor

The conclusion of these discussions was that we are avoiding duplication of docstrings by referring to the glossary.
The caveat as pointed out by @amueller and @glemaitre was that it is not possible to see this via terminal or jupyter notebook.

One approach that we took in kartothek was by introducing decorators on top of each function that auto-fill the remaining parts of docs. For e.g

If maintainers like this idea, I could propose a draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants