Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support numpy.random.Generator and/or BitGenerator for random number generation #16988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
grisaitis opened this issue Apr 21, 2020 · 21 comments · May be fixed by #23962
Open

Support numpy.random.Generator and/or BitGenerator for random number generation #16988

grisaitis opened this issue Apr 21, 2020 · 21 comments · May be fixed by #23962

Comments

@grisaitis
Copy link

Describe the workflow you want to enable

I'd like to use a Generator or BitGenerator with scikit-learn where I'd otherwise use RandomState or a seed int.

For example:

import numpy as np

bit_generator = np.random.PCG64(seed=0)
generator = np.random.Generator(bit_generator)

and then use this for random_state= in scikit-learn:

from sklearn.datasets import make_classification
from sklearn.model_selection import ShuffleSplit
from sklearn.svm import LinearSVC

X, y = make_classification(random_state=generator)  # or my bit_generator here 
classifier = LinearSVC(random_state=generator)
cv = ShuffleSplit(random_state=generator)

This fails because these methods expect a RandomState object or int seed value. The specific trigger is check_random_state(random_state).

Describe your proposed solution

This would require:

  • changing code to allow Generator or BitGenerator as acceptable values for random_state=.. in every function and class constructor that accepts random_state.
  • change check_random_state() to allow Generator and/or BitGenerator objects.
  • adding tests for using Generator or BitGenerator with classes or functions that consume random_state (similar to seed int or RandomState objects already)
  • change any internal code that assumes RandomState methods that aren't available with Generator (e.g. rand, randn, see )
  • maybe switch to using Generator instead of RandomState by default, when seed int is given

Describe alternatives you've considered, if relevant

The scope could include either or both of BitGenerator or Generator.

It might be easiest to allow only BitGenerator, and not Generator.

  • This allows flexibility.
    • Users have control over seed and PRNG algorithm.
  • This is easier to implement (can be treated just like a seed int value).
    • BitGenerator can be given to RandomState, and I think it then produces the same values as Generator.

Additional context

NumPy v1.17 added the numpy.random.Generator (docs) interface for random number generation.

Overview:

  • Generator is similar to RandomState, but enables different PRNG algorithms
  • BitGenerator (docs) encapsulates the PRNG and seed value, e.g. PCG64(seed=0)
  • RandomState "is considered frozen" and uses "the slow Mersenne Twister" by default (docs)
  • RandomState can work with non-Mersenne BitGenerator objects
  • More info in NEP-19, the design document from NumPy.

The API for Generator and BitGenerator looks like:

from numpy import random

bit_generator = random.PCG64(seed=0)  # PCG64 is a BitGenerator subclass
generator = random.Generator(bit_generator)

generator.uniform(...)  # API is similar to RandomState

# there's also this, for making a PCG64-backed Generator
generator = random.default_rng(seed=0)
@rth
Copy link
Member

rth commented May 1, 2020

Thanks @grisaitis I agree this would be useful and it's part of the larger discusion on the random_state API #14042

@NicolasHug
Copy link
Member

change any internal code that assumes RandomState methods that aren't available with Generator (e.g. rand, randn, see )

These are the attributes/methods that aren't supported in Generator:

rd = np.random.RandomState(0)
gen = default_rng(0)
set(dir(rd)) - set(dir(gen))

{'get_state',
 'rand',
 'randint',
 'randn',
 'random_integers',
 'random_sample',
 'seed',
 'set_state',
 'tomaxint'}

and this is where we use (some of) them in the API:

for f in "rand(" randn randint; do git grep $f | grep -v -e "tests" -e "benchmarks" -e "examples"; done

doc/developers/utilities.rst:    >>> random_state.rand(4)
doc/modules/random_projection.rst:  >>> X = np.random.rand(100, 10000)
doc/modules/random_projection.rst:  >>> X = np.random.rand(100, 10000)
doc/tutorial/basic/tutorial.rst:  >>> X = rng.rand(10, 2000)
doc/whats_new/v0.23.rst:  algorithms. Platform-dependent C ``rand()`` was used, which is only able to
sklearn/datasets/_samples_generator.py:        centroids *= generator.rand(n_clusters, 1)
sklearn/datasets/_samples_generator.py:        centroids *= generator.rand(1, n_informative)
sklearn/datasets/_samples_generator.py:        A = 2 * generator.rand(n_informative, n_informative) - 1
sklearn/datasets/_samples_generator.py:        B = 2 * generator.rand(n_informative, n_redundant) - 1
sklearn/datasets/_samples_generator.py:        indices = ((n - 1) * generator.rand(n_repeated) + 0.5).astype(np.intp)
sklearn/datasets/_samples_generator.py:        flip_mask = generator.rand(n_samples) < flip_y
sklearn/datasets/_samples_generator.py:        shift = (2 * generator.rand(n_features) - 1) * class_sep
sklearn/datasets/_samples_generator.py:        scale = 1 + 100 * generator.rand(n_features)
sklearn/datasets/_samples_generator.py:    p_c = generator.rand(n_classes)
sklearn/datasets/_samples_generator.py:    p_w_c = generator.rand(n_features, n_classes)
sklearn/datasets/_samples_generator.py:                                generator.rand(y_size - len(y)))
sklearn/datasets/_samples_generator.py:        words = np.searchsorted(cumulative_p_w_sample, generator.rand(n_words))
sklearn/datasets/_samples_generator.py:    ground_truth[:n_informative, :] = 100 * generator.rand(n_informative,
sklearn/datasets/_samples_generator.py:    X = generator.rand(n_samples, n_features)
sklearn/datasets/_samples_generator.py:    X = generator.rand(n_samples, 4)
sklearn/datasets/_samples_generator.py:    X = generator.rand(n_samples, 4)
sklearn/datasets/_samples_generator.py:    A = generator.rand(n_dim, n_dim)
sklearn/datasets/_samples_generator.py:    X = np.dot(np.dot(U, 1.0 + np.diag(generator.rand(n_dim))), V)
sklearn/datasets/_samples_generator.py:    aux = random_state.rand(dim, dim)
sklearn/datasets/_samples_generator.py:                        * random_state.rand(np.sum(aux > alpha)))
sklearn/datasets/_samples_generator.py:    t = 1.5 * np.pi * (1 + 2 * generator.rand(1, n_samples))
sklearn/datasets/_samples_generator.py:    y = 21 * generator.rand(1, n_samples)
sklearn/datasets/_samples_generator.py:    t = 3 * np.pi * (generator.rand(1, n_samples) - 0.5)
sklearn/datasets/_samples_generator.py:    y = 2.0 * generator.rand(1, n_samples)
sklearn/ensemble/_gradient_boosting.pyx:          random_state.rand(n_total_samples)
sklearn/externals/_lobpcg.py:    >>> X = np.random.rand(n, 3)
sklearn/manifold/_mds.py:        X = random_state.rand(n_samples * n_components)
sklearn/manifold/_spectral_embedding.py:        X = random_state.rand(laplacian.shape[0], n_components + 1)
sklearn/manifold/_spectral_embedding.py:            X = random_state.rand(laplacian.shape[0], n_components + 1)
sklearn/metrics/pairwise.py:    >>> X = np.random.RandomState(0).rand(5, 3)
sklearn/mixture/_base.py:            resp = random_state.rand(n_samples, self.n_components)
sklearn/random_projection.py:    >>> X = rng.rand(100, 10000)
sklearn/random_projection.py:    >>> X = rng.rand(100, 10000)
sklearn/semi_supervised/_label_propagation.py:>>> random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
sklearn/semi_supervised/_label_propagation.py:    >>> random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
sklearn/semi_supervised/_label_propagation.py:    >>> random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
sklearn/svm/src/newrand/newrand.h:// Scikit-Learn-specific random number generator replacing `rand()` originally
sklearn/svm/src/newrand/newrand.h:std::mt19937 mt_rand(std::mt19937::default_seed);
sklearn/svm/src/newrand/newrand.h:std::mt19937_64 mt_rand(std::mt19937::default_seed);
sklearn/svm/src/newrand/newrand.h:// - (2) public `set_seed()` function that should be used instead of `srand()` to set a new seed.
sklearn/svm/src/newrand/newrand.h:// - (3) New internal `bounded_rand_int` function, used instead of rand() everywhere.
sklearn/svm/src/newrand/newrand.h:    // return abs( (int)mt_rand()) % orig_range;
sklearn/svm/src/newrand/newrand.h:    uint32_t x = mt_rand();
sklearn/svm/src/newrand/newrand.h:            x = mt_rand();
sklearn/utils/estimator_checks.py:    X = rng.rand(40, 10)
sklearn/utils/estimator_checks.py:        y = (2 * rng.rand(40)).astype(np.int)
sklearn/utils/estimator_checks.py:        y = (4 * rng.rand(40)).astype(np.int)
sklearn/utils/estimator_checks.py:    X = _pairwise_estimator_convert_X(rng.rand(40, 10), estimator_orig)
sklearn/utils/random.py:                                          rng.rand(nnz))
doc/developers/develop.rst:            return self.random_state_.randn(n_samples, n_components)
sklearn/cluster/_affinity_propagation.py:          random_state.randn(n_samples, n_samples))
sklearn/datasets/_samples_generator.py:    X[:, :n_informative] = generator.randn(n_samples, n_informative)
sklearn/datasets/_samples_generator.py:        X[:, -n_useless:] = generator.randn(n_samples, n_useless)
sklearn/datasets/_samples_generator.py:        X = generator.randn(n_samples, n_features)
sklearn/datasets/_samples_generator.py:        + 10 * X[:, 3] + 5 * X[:, 4] + noise * generator.randn(n_samples)
sklearn/datasets/_samples_generator.py:        + noise * generator.randn(n_samples)
sklearn/datasets/_samples_generator.py:        + noise * generator.randn(n_samples)
sklearn/datasets/_samples_generator.py:    u, _ = linalg.qr(generator.randn(n_samples, n), mode='economic')
sklearn/datasets/_samples_generator.py:    v, _ = linalg.qr(generator.randn(n_features, n), mode='economic')
sklearn/datasets/_samples_generator.py:    D = generator.randn(n_features, n_components)
sklearn/datasets/_samples_generator.py:        X[idx, i] = generator.randn(n_nonzero_coefs)
sklearn/datasets/_samples_generator.py:    X += noise * generator.randn(3, n_samples)
sklearn/datasets/_samples_generator.py:    X += noise * generator.randn(3, n_samples)
sklearn/decomposition/_dict_learning.py:            dictionary[:, k] = random_state.randn(n_features)
sklearn/decomposition/_nmf.py:        H = avg * rng.randn(n_components, n_features).astype(X.dtype,
sklearn/decomposition/_nmf.py:        W = avg * rng.randn(n_samples, n_components).astype(X.dtype,
sklearn/decomposition/_nmf.py:        W[W == 0] = abs(avg * rng.randn(len(W[W == 0])) / 100)
sklearn/decomposition/_nmf.py:        H[H == 0] = abs(avg * rng.randn(len(H[H == 0])) / 100)
sklearn/feature_selection/_mutual_info.py:        X[:, continuous_mask] += 1e-10 * means * rng.randn(
sklearn/feature_selection/_mutual_info.py:        y += 1e-10 * np.maximum(1, np.mean(np.abs(y))) * rng.randn(n_samples)
sklearn/kernel_ridge.py:    >>> y = rng.randn(n_samples)
sklearn/kernel_ridge.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/linear_model/_ridge.py:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_ridge.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/linear_model/_sag.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/linear_model/_sag.py:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_stochastic_gradient.py:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_stochastic_gradient.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/manifold/_t_sne.py:            X_embedded = 1e-4 * random_state.randn(
sklearn/mixture/_base.py:                mean + rng.randn(sample, n_features) * np.sqrt(covariance)
sklearn/neighbors/_nca.py:                transformation = self.random_state_.randn(n_components,
sklearn/svm/_classes.py:    >>> y = rng.randn(n_samples)
sklearn/svm/_classes.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/svm/_classes.py:    >>> y = np.random.randn(n_samples)
sklearn/svm/_classes.py:    >>> X = np.random.randn(n_samples, n_features)
sklearn/utils/estimator_checks.py:    X = rng.randn(10, 5)
sklearn/utils/estimator_checks.py:    X_test = np.random.randn(20, 2) + 4
doc/developers/develop.rst:        i = random_state.randint(X.shape[0])
doc/getting_started.rst:  >>> from scipy.stats import randint
doc/getting_started.rst:  >>> param_distributions = {'n_estimators': randint(1, 5),
doc/getting_started.rst:  ...                        'max_depth': randint(5, 10)}
doc/modules/grid_search.rst:``uniform`` or ``randint``.
doc/modules/impute.rst:  >>> mask = np.random.randint(0, 2, size=X.shape).astype(np.bool)
sklearn/cluster/_kmeans.py:    center_id = random_state.randint(n_samples)
sklearn/cluster/_kmeans.py:        init_indices = random_state.randint(0, n_samples, init_size)
sklearn/cluster/_kmeans.py:        seeds = random_state.randint(np.iinfo(np.int32).max, size=n_init)
sklearn/cluster/_kmeans.py:        validation_indices = random_state.randint(0, n_samples, init_size)
sklearn/cluster/_kmeans.py:            minibatch_indices = random_state.randint(
sklearn/cluster/_kmeans.py:            random_reassign = self.random_state_.randint(
sklearn/cluster/_spectral.py:        rotation[:, 0] = vectors[random_state.randint(n_samples), :].T
sklearn/datasets/_kddcup99.py:        r = random_state.randint(0, n_samples_abnormal, 3377)
sklearn/datasets/_samples_generator.py:        return np.hstack([rng.randint(2, size=(samples, dimensions - 30)),
sklearn/datasets/_samples_generator.py:        y[flip_mask] = generator.randint(n_classes, size=flip_mask.sum())
sklearn/datasets/_samples_generator.py:            words = generator.randint(n_features, size=n_words)
sklearn/dummy.py:                ret = [classes_[k][rs.randint(n_classes_[k], size=n_samples)]
sklearn/ensemble/_bagging.py:        indices = random_state.randint(0, n_population, n_samples)
sklearn/ensemble/_bagging.py:            random_state.randint(MAX_INT, size=len(self.estimators_))
sklearn/ensemble/_bagging.py:        seeds = random_state.randint(MAX_INT, size=n_more_estimators)
sklearn/ensemble/_base.py:            to_set[key] = random_state.randint(np.iinfo(np.int32).max)
sklearn/ensemble/_forest.py:    sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap)
sklearn/ensemble/_forest.py:                random_state.randint(MAX_INT, size=len(self.estimators_))
sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:            self._random_seed = rng.randint(np.iinfo(np.uint32).max,
sklearn/feature_extraction/image.py:        i_s = rng.randint(i_h - p_h + 1, size=n_patches)
sklearn/feature_extraction/image.py:        j_s = rng.randint(i_w - p_w + 1, size=n_patches)
sklearn/inspection/_permutation_importance.py:    random_seed = random_state.randint(np.iinfo(np.int32).max + 1)
sklearn/linear_model/_base.py:    seed = rng.randint(1, np.iinfo(np.int32).max)
sklearn/linear_model/_cd_fast.pyx:    cdef UINT32_t rand_r_state_seed = rng.randint(0, RAND_R_MAX)
sklearn/linear_model/_cd_fast.pyx:    cdef UINT32_t rand_r_state_seed = rng.randint(0, RAND_R_MAX)
sklearn/linear_model/_cd_fast.pyx:    cdef UINT32_t rand_r_state_seed = rng.randint(0, RAND_R_MAX)
sklearn/linear_model/_cd_fast.pyx:    cdef UINT32_t rand_r_state_seed = rng.randint(0, RAND_R_MAX)
sklearn/linear_model/_stochastic_gradient.py:    seed = random_state.randint(MAX_INT)
sklearn/linear_model/_stochastic_gradient.py:        seeds = random_state.randint(MAX_INT, size=len(self.classes_))
sklearn/linear_model/_stochastic_gradient.py:        seed = random_state.randint(0, np.iinfo(np.int32).max)
sklearn/manifold/_mds.py:        seeds = random_state.randint(np.iinfo(np.int32).max, size=n_init)
sklearn/model_selection/_search.py:                        params[k] = v[rng.randint(len(v))]
sklearn/naive_bayes.py:    >>> X = rng.randint(5, size=(6, 100))
sklearn/naive_bayes.py:    >>> X = rng.randint(5, size=(6, 100))
sklearn/naive_bayes.py:    >>> X = rng.randint(5, size=(6, 100))
sklearn/naive_bayes.py:    >>> X = rng.randint(5, size=(6, 100))
sklearn/neural_network/_rbm.py:               rng.randint(0, v.shape[1], v.shape[0]))
sklearn/svm/_base.py:        seed = rnd.randint(np.iinfo('i').max)
sklearn/svm/_base.py:        class_weight_, max_iter, rnd.randint(np.iinfo('i').max),
sklearn/svm/_base.py:    # Regarding rnd.randint(..) in the above signature:
sklearn/tree/_splitter.pyx:        self.rand_r_state = self.random_state.randint(0, RAND_R_MAX)
sklearn/utils/__init__.py:            indices = random_state.randint(0, n_samples, size=(max_n_samples,))
sklearn/utils/_random.pyx:            O(O(np.random.randint) * \sum_{i=1}^n_samples 1 /
sklearn/utils/_random.pyx:            <= O(O(np.random.randint) *
sklearn/utils/_random.pyx:            <= O(O(np.random.randint) *
sklearn/utils/_random.pyx:    rng_randint = rng.randint
sklearn/utils/_random.pyx:        j = rng_randint(n_population)
sklearn/utils/_random.pyx:            j = rng_randint(n_population)
sklearn/utils/_random.pyx:    Time complexity: O(n_population +  O(np.random.randint) * n_samples)
sklearn/utils/_random.pyx:    rng_randint = rng.randint
sklearn/utils/_random.pyx:        j = rng_randint(n_population - i)  # invariant: non-selected at [0,n-i)
sklearn/utils/_random.pyx:        O((n_population - n_samples) * O(np.random.randint) + n_samples)
sklearn/utils/_random.pyx:    rng_randint = rng.randint
sklearn/utils/_random.pyx:        j = rng_randint(0, i + 1)
sklearn/utils/estimator_checks.py:    y = rnd.randint(3, size=X.shape[0])
sklearn/utils/estimator_checks.py:        y_ = np.vstack([y, 2 * y + rnd.randint(2, size=len(y))])
sklearn/utils/estimator_checks.py:        y_ = np.vstack([y, 2 * y + rnd.randint(2, size=len(y))])
sklearn/utils/estimator_checks.py:        y = rng.randint(low=0, high=2, size=n_samples)
sklearn/utils/estimator_checks.py:        y = rng.randint(low=0, high=2, size=n_samples)

If we want to support generators, I'm afraid that means we have to start wrapping RandomState and Generator into something that can backfall when randint and the likes aren't available?

@lorentzenchr lorentzenchr added this to the 1.0 milestone Nov 15, 2020
@lorentzenchr
Copy link
Member

I milestoned it for 1.0 as I think #14042 and SLEP011 have a different focus and I would like to have support for the new numpy random Generators, which have some clear advantageous over the old RandomState.

@NicolasHug
Copy link
Member

NicolasHug commented Nov 15, 2020

Before milestoning it I think we need to collectively agree this is something that we want to support. A 1.0 milestone seems unrealistic to me considering that we aim at releasing 1.0 in place of 0.25/0.26

As noted above it would not be a trivial change, and our support for RandomState is already very complex and subtle, even for us (see docs, and the new version of SLEP011 which I never had the guts to submit).

I'd be interested in knowing the impact of using Generatos on our CV procedures and meta-estimators, for example.

@lorentzenchr
Copy link
Member

lorentzenchr commented Nov 15, 2020

@NicolasHug I fully agree that we need to agree on this. But I wanted to give it some visibility. I can put it on the agenda of the next dev meeting and/or label it "breaking change" (though I think it does not need to break anything).

One large impact is that Generator will produce different random numbers than RandomState. But those have better properties of randomness and a faster in a several instances (e.g. choice, see also numpy PR #13812) and usable with Cython! See here for a high level overview of differences/improvements.

@lorentzenchr
Copy link
Member

Decision of https://github.com/scikit-learn/administrative/blob/master/meeting_notes/2020-11-30.md: we keep it in the 1.0 milestone for now. This may, however, change.

@albertcthomas
Copy link
Contributor

The new numpy RNGs can also be easily used for parallel random number generation. This would provide a better alternative when creating seeds in bagging for instance (see here). I do not know if there are other such examples in the scikit-learn codebase.

@NicolasHug
Copy link
Member

That's a nice feature but we can only take advantage of that once we stop supporting RandomState instances. As long as RandomState is supported, we'll have to resort to workarounds like the one in the link you provided.

@albertcthomas
Copy link
Contributor

albertcthomas commented Dec 14, 2020 via email

@lorentzenchr
Copy link
Member

An implementation of this proposal becomes much easier when numpy 1.17.0, where the new random Generators were introduced, becomes the minimum numpy version.

@jamesmyatt
Copy link
Contributor

jamesmyatt commented Apr 6, 2021

Sorry for the drive-by comment, but the NEP 29 deprecation date for NumPy 1.16 (13-Jan-2021) has now passed. Now, this doesn't mean that scikit-learn should stop supporting RandomState (even if that would be nice). It just means that it should be OK to use the new numpy.random API now without adapters or defensive programming. It's also worth remembering that if this is implemented as a breaking change, then it might be easiest to introduce it in v<=1.0.0, rather than waiting for v2.0.

From reading this and other threads (e.g. #14042, scikit-learn/enhancement_proposals#24, https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness), it's clear that this is a complicated issue.

@NicolasHug
Copy link
Member

It just means that it should be OK to use the new numpy.random API now without adapters or defensive programming.

Would you mind elaborating on this? For now I don't understand how NEP29 and the deprecation of numpy 1.16 changes anything to the concerns raised in #16988 (comment). It seems to me that these will be valid concerns for as long as we support RandomState.

@jamesmyatt
Copy link
Contributor

jamesmyatt commented Apr 11, 2021

It just means that it should be OK to use the new numpy.random API now without adapters or defensive programming.

Would you mind elaborating on this? For now I don't understand how NEP29 and the deprecation of numpy 1.16 changes anything to the concerns raised in #16988 (comment). It seems to me that these will be valid concerns for as long as we support RandomState.

NEP 29 says:

When a project releases a new major or minor version, we recommend that they support at least all minor versions of Python introduced and released in the prior 42 months from the anticipated release date with a minimum of 2 minor versions of Python, and all minor versions of NumPy released in the prior 24 months from the anticipated release date with a minimum of 3 minor versions of NumPy.

i.e. there should be no obligation to support NumPy 1.16 in any major or minor release after Jan 13, 2021.

The main thing is that, if you bump the minimum version of NumPy to 1.17, then you can write things like: if isinstance(random_state, np.random.BitGenerator) without first checking if np.random.BitGenerator exists.

Also, since NumPy 1.17, the RandomState init has permitted a BitGenerator as an input. So the simplest workaround for handling Generator inputs to these functions is to extract the associated bit generator and re-wrap it as a RandomState: np.random.RandomState(rng.bit_generator). This will use the new bit generators along with the legacy API and non-uniform algorithms. However, doing this will not give you many of the main advantages of the new random API. A better long-term solution might be to extract the bit generator from the RandomState input and re-wrap it as a Generator and use the newer, faster, cleaner, more powerful API everywhere.

As for #16988 (comment) specifically, I think that most of those are issues of methods having given better names in the new API than the legacy one (see https://numpy.org/doc/stable/reference/random/index.html#quick-start). I don't think there's any guarantee that two methods with the same name will give the same random number streams, since the many of the algorithms used to convert the random bits to random numbers have been improved, e.g. more efficient implementation, or even that they have exactly the same arguments. The exception is the state related ones, which is even more complicated (e.g. https://numpy.org/doc/stable/reference/random/generated/numpy.random.RandomState.get_state.html).

I hope this is clear; I know it's brief and the topic is complex.

@rkern
Copy link

rkern commented Aug 4, 2021

FYI, rand, randn, and random_sample should all be considered disrecommended variants and aliases, and you should use the preferred methods on RandomState that also exist on Generator: random() and standard_normal().

For randint(), we use a utility function in scipy to wrap around either a RandomState or Generator. Part of the improvement of Generator.integers() was the semantics of processing the arguments, so there wasn't a great way to retain randint(). The wrapper function uses those new arguments, so there is a tiny bit of thought to be appleid when making that replacement.

I don't think you use any of the others.

@grisaitis
Copy link
Author

grisaitis commented Aug 11, 2021

@rkern Thanks for that summary of the API changes.

Regarding the first group of methods...

FYI, rand, randn, and random_sample should all be considered disrecommended variants and aliases, and you should use the preferred methods on RandomState that also exist on Generator: random() and standard_normal().

Would a good first step (PR) be to refactor the calls to rand, etc, to the new random and standard_normal? This seems pretty doable. The arguments look only slightly different, and the new Generator methods are also available on RandomState. So code would be backward compatible with RandomState.

Regarding integers / randint... Not sure the best way forward on that. Any suggestions @NicolasHug @lorentzenchr ?

Edit - it looks like randint is only used for k-means and spectral clustering in the sklearn codebase. And then in lots of examples and benchmarks. If I'm not mistaken there, as far as core sklearn is concerned, refactoring that might not be so bad.

@adrinjalali
Copy link
Member

Doesn't seem like this is gonna be resolved before the release. Removing the milestone. Please re-tag if necessary, for an appropriate next release since I think you're more aware of the progress on this topic than I am.

@lorentzenchr
Copy link
Member

As cross ref, see discussions in scipy/scipy#14322 (comment) following comments.

@amueller
Copy link
Member

@rkern could you maybe either point towards some documentation or explain how the current design helps with the usability questions we describe in https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness and that I mentioned in the other thread?

@betatim
Copy link
Member

betatim commented Apr 21, 2023

One document I think is relevant and a good read(!) is https://numpy.org/devdocs/reference/random/parallel.html

@rkern
Copy link

rkern commented Apr 21, 2023

Let me summarize my simplified understanding of the main issue. I know there are more issues, particularly the internal use of parallelism, but let me focus on what I took as the main issue from that document. For most of scikit-learn's classes, you do want all random draws used when calling its methods to be independent, so holding a PRNG instance and drawing from it in sequence works fine. But there are a few exceptions where you do want certain operations on certain objects to repeat their results. In particular, when comparing multiple models using the scores over cross-validation splits, splitters should do the same splits when given the same data.

This special case is an inherent complexity. However, when you only have the relatively inflexible primitives of integer seeds and RandomState, this inherent complexity got amplified and placed on the user. Because the main pattern of passing around a single RandomState doesn't work for CV, you have to have this big document to explain how to use a different pattern in that one special case. The user has to learn this complex mental model because RandomState did not give you, as the API designers, the tools to hide that complexity from them.

With the Generator infrastructure, we have more flexible and capable primitives for you to work with. This will allow you, as implementors, to design the API and implementation of your objects so you can make the choice which objects need to internally repeat their results and which ones should be independent with each call. Or if you do want to expose that choice to the user, you can make it an explicit option (i.e. just a bool argument) rather than an implicit consequence of different styles of PRNG state management. With the Generator infrastructure, you can tell users to create one Generator using np.random.default_rng(seed) and pass that around to all objects. You can use the capabilities of that infrastructure to handle the complexities internally. Users may still have to know that CV splitters use randomness differently (the inherent complexity), but they won't have to change whole patterns of data flow to do that.

Lengthy explanation

The core enabling technology for the Generator infrastructure is SeedSequence. We use this object as the way to convert integer seeds from the user to the initial states for the PRNG algorithm. It takes arbitrary amounts of integer data (one arbitrarily large integer or a sequence of them) and hashes them together carefully to create high quality states for the PRNGs. This will be true for instantiating multiple Generators with closely related seeds like int_seed, [int_seed, 0], [int_seed, 1], etc. (remember that sequences of integers are allowed). They will each create PRNG states that will be usefully independent from one another (within the limits of pseudo-randomness in the first place). This algorithm enabled us to implement a pattern that we called SeedSequence spawning to deterministically create multiple usefully-independent PRNGs for parallel use. I'm going to talk about SeedSequence here so you know about the internals and how it all works, but your users won't have to work with it. We have hidden everything under Generator, so all the users have to do is grab one with default_rng(seed) and pass it around.

default_rng(seed) will take the seed and create a SeedSequence and pass that to the BitGenerator (which is the core uniform PRNG algorithm) and wraps that up in a Generator (which implements all of the fancy non-uniform distributions and is what the user works with). The BitGenerator uses the SeedSequence to get a good PRNG state to start with, but keeps a reference to the SeedSequence for later. So we can reach down to the SeedSequence, spawn new SeedSequences from it, then wrap those up in the BitGenerator and Generator. In the upcoming 1.25 release, we have exposed a Generator.spawn() method that will just do all of that. But you can use a utility function that will do the same until you can depend on 1.25. From here on, I'll just talk about "spawning Generators" for simplicity.

Whenever you get a Generator, you can spawn from it. The internal SeedSequence essentially is building a unique path of the splits it took down its spawn tree and hashing that together to get the (almost certainly) unique internal state. So you can spawn a Generator from a spawned Generator without worry. This means you don't have to do any manual coordination of seeds and just reason about your code locally. You can probably see how this all can be used for actual multiprocessing parallelism, but we can get into that later. Now we want to talk about how to use this make CV splits repeatable over multiple calls to cv.split(X, y), and that's less obvious.

In general, it is dicey to clone RandomState or Generator objects. Except in very special circumstances, you run the risk of having multiple correlated streams of pseudo-randomness when you did not want them. But in the case of cv.split(), you do want that correlation. The main principle to remember is "Only clone it if you own it." That is, only if you own the entire Generator's lifecycle from creation and all of its method calls, then it can be safe to clone it (i.e. copy.deepcopy(rng)) and reuse it to achieve those repeatable effects. With RandomState, it was hard to really own it in that way. If you were passed in a RandomState, there was no good way to get a fresh one deterministically that you could own and cloning the passed-in RandomState violated the principle. The recommendation to pass in an integer seed to the CV APIs was the only way to do this. When you built a RandomState from a seed, you owned it and could reset to that initial state over and over again when you needed to (cloning in effect).

With Generator spawning, you now can take a passed-in Generator, spawn a Generator that you can actually own, and clone it to keep in safe storage for reuse later. We are using this pattern in SciPy. We have quasi-Monte Carlo generators that use pseudo-random scrambling. We want to be able to reset them to their initial state. We coerce the input to a Generator (i.e. rng = default_rng(rng) so no matter if it's None, an integer seed, or a live Generator, we get a Generator), we spawn from it a Generator that we own, then clone a stored, static copy of our owned, spawned Generator. We never call any methods on the stored static copy; it always remains in that original state. Whenever we want to start over (e.g. at the start of cv.split()), we clone a working copy from stored static copy and use the working copy.

All of this is information that the user doesn't need to know. They might need to know that cv.split(X, y) will always return the same results where other objects don't behave that way. That's the inherent complexity that you get a benefit from. But they don't have to know anything about it to correctly construct a CV splitter. The rule is the same for all objects. As the implementors, you get to make the decision.

@lorentzenchr
Copy link
Member

For cross ref: (draft) SPEC 7 scientific-python/specs#180 mentions scikit-learn.

stefanv added a commit to stefanv/scikit-image that referenced this issue May 19, 2023
Otherwise, the user can still draw values from the RNG and change its
state.

See scikit-learn/scikit-learn#16988 (comment)
jarrodmillman pushed a commit to scikit-image/scikit-image that referenced this issue May 22, 2023
Otherwise, the user can still draw values from the RNG and change its
state.

See scikit-learn/scikit-learn#16988 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants