Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Make random_state descriptions more informative and refer to Glossary #10548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
71 of 74 tasks
jnothman opened this issue Jan 29, 2018 · 60 comments
Closed
71 of 74 tasks

Make random_state descriptions more informative and refer to Glossary #10548

jnothman opened this issue Jan 29, 2018 · 60 comments
Labels
Documentation good first issue Easy with clear instructions to resolve Moderate Anything that requires some knowledge of conventions and best practices Sprint

Comments

@jnothman
Copy link
Member

jnothman commented Jan 29, 2018

We recently added a Glossary to our documentation, which describes common parameters among other things. We should now replace descriptions of random_state parameters to make them more concise and informative (see #10415). For example, instead of

    random_state : int, RandomState instance or None, optional, default: None
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

in both KMeans and MiniBatchKMeans, we might have:

KMeans:
    random_state : int, RandomState instance, default=None
        Determines random number generation for centroid initialization.
        Pass an int for reproducible results across multiple function calls.
        See :term:`Glossary <random_state>`.


MiniBatchKMeans:
    random_state : int, RandomState instance, default=None
        Determines random number generation for centroid initialization and
        random reassignment.
        Pass an int for reproducible results across multiple function calls.
        See :term:`Glossary <random_state>`.

Therefore, the description should focus on what is the impact of random_state on the algorithm.

Contributors interested in contributing this change should take on one module at a time, initially.

The list of estimators to be modified is the following:

List of files to modify using kwinata script

@jnothman jnothman added Documentation good first issue Easy with clear instructions to resolve help wanted labels Jan 29, 2018
@aby0
Copy link
Contributor

aby0 commented Jan 29, 2018

Hi @jnothman, Can I take this issue? Thanks

@jnothman
Copy link
Member Author

jnothman commented Jan 29, 2018 via email

@aby0
Copy link
Contributor

aby0 commented Jan 29, 2018

@jnothman I am sorry for being naive but can you elaborate about the module/submodule? I mean are you referring to a sub-package like Kmeans for instance?

@lesteve
Copy link
Member

lesteve commented Jan 29, 2018

I think what @jnothman means is just start with one file, for example sklearn/cluster/k_means_.py, update the random_state docstring as in the top post and open a PR.

@jnothman
Copy link
Member Author

jnothman commented Jan 29, 2018 via email

@aby0
Copy link
Contributor

aby0 commented Jan 30, 2018

Thanks. Will do that and open a PR.

@ghost
Copy link

ghost commented Jan 30, 2018

Hi! @jnothman

Would you also like to replace the following comments as seen in grid_search.py? They have an extra line as compared to the one shared by you.

random_state : int, RandomState instance or None, optional (default=None)
        Pseudo random number generator state used for random uniform sampling
        from lists of possible values instead of scipy.stats distributions.
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

@ghost
Copy link

ghost commented Jan 30, 2018

I can take grid_search.py and k_means.py(KMeans).

@jnothman
Copy link
Member Author

jnothman commented Jan 30, 2018 via email

@ghost
Copy link

ghost commented Jan 31, 2018

Thanks @jnothman. WIll I need to understand these algorithms before I can replace this random_state information?

@jnothman
Copy link
Member Author

jnothman commented Jan 31, 2018 via email

@ghost
Copy link

ghost commented Feb 3, 2018

Okay, thank you. I will start going through the algorithms slowly.

Regards,
Shivam Rastogi

@ghost

This comment has been minimized.

@ghost
Copy link

ghost commented Feb 10, 2018

Since @aby0 has not claimed the sklearn.cluster module yet. I would like to claim the whole module. Please let me know if I can work on it or I should work on something else.

@ghost
Copy link

ghost commented Feb 15, 2018

Any update guys? It is a long holiday for us so let me know if I can pick this.

@richford
Copy link
Contributor

I'll take the datasets module since I'm already poking around in the docstrings there for #10731.

@SirR4T
Copy link
Contributor

SirR4T commented Aug 23, 2018

I'm claiming the linear_model module. will raise a PR soon. #11900 raised.

Claiming decomposition module next.

SirR4T added a commit to SirR4T/scikit-learn that referenced this issue Aug 23, 2018
@SirR4T

This comment has been minimized.

@jnothman
Copy link
Member Author

jnothman commented Aug 23, 2018 via email

@jnothman
Copy link
Member Author

So do pay attention to the prior PRs merged above

@SirR4T
Copy link
Contributor

SirR4T commented Aug 23, 2018

@jnothman thanks! will update the PRs for to mention the reproducibility when passing an int.

@DatenBiene
Copy link
Contributor

Claim sklearn/ensemble/_weight_boosting.py - 188, 324, 479, 900, 1022

@DatenBiene
Copy link
Contributor

Claim sklearn/multioutput.py - 578, 738

@DatenBiene
Copy link
Contributor

Claim :
sklearn/mixture/_bayesian_mixture.py - 166
sklearn/mixture/_base.py - 139
sklearn/mixture/_gaussian_mixture.py - 504

@DatenBiene
Copy link
Contributor

Claim sklearn/ensemble/_gb.py - 887, 1360

@DatenBiene
Copy link
Contributor

Claim sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py - 736, 918

@GregoireMialon
Copy link
Contributor

Claim sklearn/neural_network/_rbm.py - 59

@DatenBiene
Copy link
Contributor

Claim :

sklearn/svm/_classes.py - 90, 312, 546, 752
sklearn/svm/_base.py - 853

@DatenBiene
Copy link
Contributor

Claim:

sklearn/feature_selection/_mutual_info.py - 226, 335, 414
sklearn/metrics/cluster/_unsupervised.py - 80
sklearn/utils/_testing.py - 521
sklearn/utils/init.py - 478, 623

@DatenBiene
Copy link
Contributor

Claim :

sklearn/dummy.py - 59
sklearn/random_projection.py - 178, 245, 464, 586

@jeremiedbb
Copy link
Member

@DatenBiene @GregoireMialon Thanks for all your contributions during last sprint. There are only 3 modules left unchecked !

Would you be interested / have time / have motivation to tackle those (no pressure !) ?

@GregoireMialon
Copy link
Contributor

GregoireMialon commented Feb 13, 2020 via email

@DatenBiene
Copy link
Contributor

Hi @jeremiedbb! I will try to finish the 3 remaining modules today 😃

Claim:

sklearn/kernel_approximation.py - 41, 143, 470
sklearn/multiclass.py - 687
sklearn/ensemble/_base.py - 52

@DatenBiene
Copy link
Contributor

Hi @jnothman and @jeremiedbb, looks like all the files where modified. I would be happy to help if you find any remaining issues.

@cmarmo
Copy link
Contributor

cmarmo commented Apr 16, 2020

Thanks a lot @DatenBiene and all the contributors that worked to close this issue!
I think we can close this huge one!
Feel free to open new specific issues if something is still missing about random_state description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation good first issue Easy with clear instructions to resolve Moderate Anything that requires some knowledge of conventions and best practices Sprint
Projects
None yet
Development

Successfully merging a pull request may close this issue.