Support numpy.random.Generator and/or BitGenerator for random number generation #16988

grisaitis · 2020-04-21T21:31:23Z

Describe the workflow you want to enable

I'd like to use a Generator or BitGenerator with scikit-learn where I'd otherwise use RandomState or a seed int.

For example:

import numpy as np

bit_generator = np.random.PCG64(seed=0)
generator = np.random.Generator(bit_generator)

and then use this for random_state= in scikit-learn:

from sklearn.datasets import make_classification
from sklearn.model_selection import ShuffleSplit
from sklearn.svm import LinearSVC

X, y = make_classification(random_state=generator)  # or my bit_generator here 
classifier = LinearSVC(random_state=generator)
cv = ShuffleSplit(random_state=generator)

This fails because these methods expect a RandomState object or int seed value. The specific trigger is check_random_state(random_state).

Describe your proposed solution

This would require:

changing code to allow Generator or BitGenerator as acceptable values for random_state=.. in every function and class constructor that accepts random_state.
change check_random_state() to allow Generator and/or BitGenerator objects.
adding tests for using Generator or BitGenerator with classes or functions that consume random_state (similar to seed int or RandomState objects already)
change any internal code that assumes RandomState methods that aren't available with Generator (e.g. rand, randn, see )
maybe switch to using Generator instead of RandomState by default, when seed int is given

Describe alternatives you've considered, if relevant

The scope could include either or both of BitGenerator or Generator.

It might be easiest to allow only BitGenerator, and not Generator.

This allows flexibility.
- Users have control over seed and PRNG algorithm.
This is easier to implement (can be treated just like a seed int value).
- BitGenerator can be given to RandomState, and I think it then produces the same values as Generator.

Additional context

NumPy v1.17 added the numpy.random.Generator (docs) interface for random number generation.

Overview:

Generator is similar to RandomState, but enables different PRNG algorithms
BitGenerator (docs) encapsulates the PRNG and seed value, e.g. PCG64(seed=0)
RandomState "is considered frozen" and uses "the slow Mersenne Twister" by default (docs)
RandomState can work with non-Mersenne BitGenerator objects
More info in NEP-19, the design document from NumPy.

The API for Generator and BitGenerator looks like:

from numpy import random

bit_generator = random.PCG64(seed=0)  # PCG64 is a BitGenerator subclass
generator = random.Generator(bit_generator)

generator.uniform(...)  # API is similar to RandomState

# there's also this, for making a PCG64-backed Generator
generator = random.default_rng(seed=0)

The text was updated successfully, but these errors were encountered:

rth · 2020-05-01T10:59:39Z

Thanks @grisaitis I agree this would be useful and it's part of the larger discusion on the random_state API #14042

NicolasHug · 2020-05-01T12:16:41Z

change any internal code that assumes RandomState methods that aren't available with Generator (e.g. rand, randn, see )

These are the attributes/methods that aren't supported in Generator:

rd = np.random.RandomState(0)
gen = default_rng(0)
set(dir(rd)) - set(dir(gen))

{'get_state',
 'rand',
 'randint',
 'randn',
 'random_integers',
 'random_sample',
 'seed',
 'set_state',
 'tomaxint'}

and this is where we use (some of) them in the API:

for f in "rand(" randn randint; do git grep $f | grep -v -e "tests" -e "benchmarks" -e "examples"; done

doc/developers/utilities.rst:    >>> random_state.rand(4)
doc/modules/random_projection.rst:  >>> X = np.random.rand(100, 10000)
doc/modules/random_projection.rst:  >>> X = np.random.rand(100, 10000)
doc/tutorial/basic/tutorial.rst:  >>> X = rng.rand(10, 2000)
doc/whats_new/v0.23.rst:  algorithms. Platform-dependent C ``rand()`` was used, which is only able to
sklearn/datasets/_samples_generator.py:        centroids *= generator.rand(n_clusters, 1)
sklearn/datasets/_samples_generator.py:        centroids *= generator.rand(1, n_informative)
sklearn/datasets/_samples_generator.py:        A = 2 * generator.rand(n_informative, n_informative) - 1
sklearn/datasets/_samples_generator.py:        B = 2 * generator.rand(n_informative, n_redundant) - 1
sklearn/datasets/_samples_generator.py:        indices = ((n - 1) * generator.rand(n_repeated) + 0.5).astype(np.intp)
sklearn/datasets/_samples_generator.py:        flip_mask = generator.rand(n_samples) < flip_y
sklearn/datasets/_samples_generator.py:        shift = (2 * generator.rand(n_features) - 1) * class_sep
sklearn/datasets/_samples_generator.py:        scale = 1 + 100 * generator.rand(n_features)
sklearn/datasets/_samples_generator.py:    p_c = generator.rand(n_classes)
sklearn/datasets/_samples_generator.py:    p_w_c = generator.rand(n_features, n_classes)
sklearn/datasets/_samples_generator.py:                                generator.rand(y_size - len(y)))
sklearn/datasets/_samples_generator.py:        words = np.searchsorted(cumulative_p_w_sample, generator.rand(n_words))
sklearn/datasets/_samples_generator.py:    ground_truth[:n_informative, :] = 100 * generator.rand(n_informative,
sklearn/datasets/_samples_generator.py:    X = generator.rand(n_samples, n_features)
sklearn/datasets/_samples_generator.py:    X = generator.rand(n_samples, 4)
sklearn/datasets/_samples_generator.py:    X = generator.rand(n_samples, 4)
sklearn/datasets/_samples_generator.py:    A = generator.rand(n_dim, n_dim)
sklearn/datasets/_samples_generator.py:    X = np.dot(np.dot(U, 1.0 + np.diag(generator.rand(n_dim))), V)
sklearn/datasets/_samples_generator.py:    aux = random_state.rand(dim, dim)
sklearn/datasets/_samples_generator.py:                        * random_state.rand(np.sum(aux > alpha)))
sklearn/datasets/_samples_generator.py:    t = 1.5 * np.pi * (1 + 2 * generator.rand(1, n_samples))
sklearn/datasets/_samples_generator.py:    y = 21 * generator.rand(1, n_samples)
sklearn/datasets/_samples_generator.py:    t = 3 * np.pi * (generator.rand(1, n_samples) - 0.5)
sklearn/datasets/_samples_generator.py:    y = 2.0 * generator.rand(1, n_samples)
sklearn/ensemble/_gradient_boosting.pyx:          random_state.rand(n_total_samples)
sklearn/externals/_lobpcg.py:    >>> X = np.random.rand(n, 3)
sklearn/manifold/_mds.py:        X = random_state.rand(n_samples * n_components)
sklearn/manifold/_spectral_embedding.py:        X = random_state.rand(laplacian.shape[0], n_components + 1)
sklearn/manifold/_spectral_embedding.py:            X = random_state.rand(laplacian.shape[0], n_components + 1)
sklearn/metrics/pairwise.py:    >>> X = np.random.RandomState(0).rand(5, 3)
sklearn/mixture/_base.py:            resp = random_state.rand(n_samples, self.n_components)
sklearn/random_projection.py:    >>> X = rng.rand(100, 10000)
sklearn/random_projection.py:    >>> X = rng.rand(100, 10000)
sklearn/semi_supervised/_label_propagation.py:>>> random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
sklearn/semi_supervised/_label_propagation.py:    >>> random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
sklearn/semi_supervised/_label_propagation.py:    >>> random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
sklearn/svm/src/newrand/newrand.h:// Scikit-Learn-specific random number generator replacing `rand()` originally
sklearn/svm/src/newrand/newrand.h:std::mt19937 mt_rand(std::mt19937::default_seed);
sklearn/svm/src/newrand/newrand.h:std::mt19937_64 mt_rand(std::mt19937::default_seed);
sklearn/svm/src/newrand/newrand.h:// - (2) public `set_seed()` function that should be used instead of `srand()` to set a new seed.
sklearn/svm/src/newrand/newrand.h:// - (3) New internal `bounded_rand_int` function, used instead of rand() everywhere.
sklearn/svm/src/newrand/newrand.h:    // return abs( (int)mt_rand()) % orig_range;
sklearn/svm/src/newrand/newrand.h:    uint32_t x = mt_rand();
sklearn/svm/src/newrand/newrand.h:            x = mt_rand();
sklearn/utils/estimator_checks.py:    X = rng.rand(40, 10)
sklearn/utils/estimator_checks.py:        y = (2 * rng.rand(40)).astype(np.int)
sklearn/utils/estimator_checks.py:        y = (4 * rng.rand(40)).astype(np.int)
sklearn/utils/estimator_checks.py:    X = _pairwise_estimator_convert_X(rng.rand(40, 10), estimator_orig)
sklearn/utils/random.py:                                          rng.rand(nnz))
doc/developers/develop.rst:            return self.random_state_.randn(n_samples, n_components)
sklearn/cluster/_affinity_propagation.py:          random_state.randn(n_samples, n_samples))
sklearn/datasets/_samples_generator.py:    X[:, :n_informative] = generator.randn(n_samples, n_informative)
sklearn/datasets/_samples_generator.py:        X[:, -n_useless:] = generator.randn(n_samples, n_useless)
sklearn/datasets/_samples_generator.py:        X = generator.randn(n_samples, n_features)
sklearn/datasets/_samples_generator.py:        + 10 * X[:, 3] + 5 * X[:, 4] + noise * generator.randn(n_samples)
sklearn/datasets/_samples_generator.py:        + noise * generator.randn(n_samples)
sklearn/datasets/_samples_generator.py:        + noise * generator.randn(n_samples)
sklearn/datasets/_samples_generator.py:    u, _ = linalg.qr(generator.randn(n_samples, n), mode='economic')
sklearn/datasets/_samples_generator.py:    v, _ = linalg.qr(generator.randn(n_features, n), mode='economic')
sklearn/datasets/_samples_generator.py:    D = generator.randn(n_features, n_components)
sklearn/datasets/_samples_generator.py:        X[idx, i] = generator.randn(n_nonzero_coefs)
sklearn/datasets/_samples_generator.py:    X += noise * generator.randn(3, n_samples)
sklearn/datasets/_samples_generator.py:    X += noise * generator.randn(3, n_samples)
sklearn/decomposition/_dict_learning.py:            dictionary[:, k] = random_state.randn(n_features)
sklearn/decomposition/_nmf.py:        H = avg * rng.randn(n_components, n_features).astype(X.dtype,
sklearn/decomposition/_nmf.py:        W = avg * rng.randn(n_samples, n_components).astype(X.dtype,
sklearn/decomposition/_nmf.py:        W[W == 0] = abs(avg * rng.randn(len(W[W == 0])) / 100)
sklearn/decomposition/_nmf.py:        H[H == 0] = abs(avg * rng.randn(len(H[H == 0])) / 100)
sklearn/feature_selection/_mutual_info.py:        X[:, continuous_mask] += 1e-10 * means * rng.randn(
sklearn/feature_selection/_mutual_info.py:        y += 1e-10 * np.maximum(1, np.mean(np.abs(y))) * rng.randn(n_samples)
sklearn/kernel_ridge.py:    >>> y = rng.randn(n_samples)
sklearn/kernel_ridge.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/linear_model/_ridge.py:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_ridge.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/linear_model/_sag.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/linear_model/_sag.py:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_stochastic_gradient.py:    >>> y = rng.randn(n_samples)
sklearn/linear_model/_stochastic_gradient.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/manifold/_t_sne.py:            X_embedded = 1e-4 * random_state.randn(
sklearn/mixture/_base.py:                mean + rng.randn(sample, n_features) * np.sqrt(covariance)
sklearn/neighbors/_nca.py:                transformation = self.random_state_.randn(n_components,
sklearn/svm/_classes.py:    >>> y = rng.randn(n_samples)
sklearn/svm/_classes.py:    >>> X = rng.randn(n_samples, n_features)
sklearn/svm/_classes.py:    >>> y = np.random.randn(n_samples)
sklearn/svm/_classes.py:    >>> X = np.random.randn(n_samples, n_features)
sklearn/utils/estimator_checks.py:    X = rng.randn(10, 5)
sklearn/utils/estimator_checks.py:    X_test = np.random.randn(20, 2) + 4
doc/developers/develop.rst:        i = random_state.randint(X.shape[0])
doc/getting_started.rst:  >>> from scipy.stats import randint
doc/getting_started.rst:  >>> param_distributions = {'n_estimators': randint(1, 5),
doc/getting_started.rst:  ...                        'max_depth': randint(5, 10)}
doc/modules/grid_search.rst:``uniform`` or ``randint``.
doc/modules/impute.rst:  >>> mask = np.random.randint(0, 2, size=X.shape).astype(np.bool)
sklearn/cluster/_kmeans.py:    center_id = random_state.randint(n_samples)
sklearn/cluster/_kmeans.py:        init_indices = random_state.randint(0, n_samples, init_size)
sklearn/cluster/_kmeans.py:        seeds = random_state.randint(np.iinfo(np.int32).max, size=n_init)
sklearn/cluster/_kmeans.py:        validation_indices = random_state.randint(0, n_samples, init_size)
sklearn/cluster/_kmeans.py:            minibatch_indices = random_state.randint(
sklearn/cluster/_kmeans.py:            random_reassign = self.random_state_.randint(
sklearn/cluster/_spectral.py:        rotation[:, 0] = vectors[random_state.randint(n_samples), :].T
sklearn/datasets/_kddcup99.py:        r = random_state.randint(0, n_samples_abnormal, 3377)
sklearn/datasets/_samples_generator.py:        return np.hstack([rng.randint(2, size=(samples, dimensions - 30)),
sklearn/datasets/_samples_generator.py:        y[flip_mask] = generator.randint(n_classes, size=flip_mask.sum())
sklearn/datasets/_samples_generator.py:            words = generator.randint(n_features, size=n_words)
sklearn/dummy.py:                ret = [classes_[k][rs.randint(n_classes_[k], size=n_samples)]
sklearn/ensemble/_bagging.py:        indices = random_state.randint(0, n_population, n_samples)
sklearn/ensemble/_bagging.py:            random_state.randint(MAX_INT, size=len(self.estimators_))
sklearn/ensemble/_bagging.py:        seeds = random_state.randint(MAX_INT, size=n_more_estimators)
sklearn/ensemble/_base.py:            to_set[key] = random_state.randint(np.iinfo(np.int32).max)
sklearn/ensemble/_forest.py:    sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap)
sklearn/ensemble/_forest.py:                random_state.randint(MAX_INT, size=len(self.estimators_))
sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:            self._random_seed = rng.randint(np.iinfo(np.uint32).max,
sklearn/feature_extraction/image.py:        i_s = rng.randint(i_h - p_h + 1, size=n_patches)
sklearn/feature_extraction/image.py:        j_s = rng.randint(i_w - p_w + 1, size=n_patches)
sklearn/inspection/_permutation_importance.py:    random_seed = random_state.randint(np.iinfo(np.int32).max + 1)
sklearn/linear_model/_base.py:    seed = rng.randint(1, np.iinfo(np.int32).max)
sklearn/linear_model/_cd_fast.pyx:    cdef UINT32_t rand_r_state_seed = rng.randint(0, RAND_R_MAX)
sklearn/linear_model/_cd_fast.pyx:    cdef UINT32_t rand_r_state_seed = rng.randint(0, RAND_R_MAX)
sklearn/linear_model/_cd_fast.pyx:    cdef UINT32_t rand_r_state_seed = rng.randint(0, RAND_R_MAX)
sklearn/linear_model/_cd_fast.pyx:    cdef UINT32_t rand_r_state_seed = rng.randint(0, RAND_R_MAX)
sklearn/linear_model/_stochastic_gradient.py:    seed = random_state.randint(MAX_INT)
sklearn/linear_model/_stochastic_gradient.py:        seeds = random_state.randint(MAX_INT, size=len(self.classes_))
sklearn/linear_model/_stochastic_gradient.py:        seed = random_state.randint(0, np.iinfo(np.int32).max)
sklearn/manifold/_mds.py:        seeds = random_state.randint(np.iinfo(np.int32).max, size=n_init)
sklearn/model_selection/_search.py:                        params[k] = v[rng.randint(len(v))]
sklearn/naive_bayes.py:    >>> X = rng.randint(5, size=(6, 100))
sklearn/naive_bayes.py:    >>> X = rng.randint(5, size=(6, 100))
sklearn/naive_bayes.py:    >>> X = rng.randint(5, size=(6, 100))
sklearn/naive_bayes.py:    >>> X = rng.randint(5, size=(6, 100))
sklearn/neural_network/_rbm.py:               rng.randint(0, v.shape[1], v.shape[0]))
sklearn/svm/_base.py:        seed = rnd.randint(np.iinfo('i').max)
sklearn/svm/_base.py:        class_weight_, max_iter, rnd.randint(np.iinfo('i').max),
sklearn/svm/_base.py:    # Regarding rnd.randint(..) in the above signature:
sklearn/tree/_splitter.pyx:        self.rand_r_state = self.random_state.randint(0, RAND_R_MAX)
sklearn/utils/__init__.py:            indices = random_state.randint(0, n_samples, size=(max_n_samples,))
sklearn/utils/_random.pyx:            O(O(np.random.randint) * \sum_{i=1}^n_samples 1 /
sklearn/utils/_random.pyx:            <= O(O(np.random.randint) *
sklearn/utils/_random.pyx:            <= O(O(np.random.randint) *
sklearn/utils/_random.pyx:    rng_randint = rng.randint
sklearn/utils/_random.pyx:        j = rng_randint(n_population)
sklearn/utils/_random.pyx:            j = rng_randint(n_population)
sklearn/utils/_random.pyx:    Time complexity: O(n_population +  O(np.random.randint) * n_samples)
sklearn/utils/_random.pyx:    rng_randint = rng.randint
sklearn/utils/_random.pyx:        j = rng_randint(n_population - i)  # invariant: non-selected at [0,n-i)
sklearn/utils/_random.pyx:        O((n_population - n_samples) * O(np.random.randint) + n_samples)
sklearn/utils/_random.pyx:    rng_randint = rng.randint
sklearn/utils/_random.pyx:        j = rng_randint(0, i + 1)
sklearn/utils/estimator_checks.py:    y = rnd.randint(3, size=X.shape[0])
sklearn/utils/estimator_checks.py:        y_ = np.vstack([y, 2 * y + rnd.randint(2, size=len(y))])
sklearn/utils/estimator_checks.py:        y_ = np.vstack([y, 2 * y + rnd.randint(2, size=len(y))])
sklearn/utils/estimator_checks.py:        y = rng.randint(low=0, high=2, size=n_samples)
sklearn/utils/estimator_checks.py:        y = rng.randint(low=0, high=2, size=n_samples)

If we want to support generators, I'm afraid that means we have to start wrapping RandomState and Generator into something that can backfall when randint and the likes aren't available?

lorentzenchr · 2020-11-15T10:04:04Z

I milestoned it for 1.0 as I think #14042 and SLEP011 have a different focus and I would like to have support for the new numpy random Generators, which have some clear advantageous over the old RandomState.

NicolasHug · 2020-11-15T10:18:03Z

Before milestoning it I think we need to collectively agree this is something that we want to support. A 1.0 milestone seems unrealistic to me considering that we aim at releasing 1.0 in place of 0.25/0.26

As noted above it would not be a trivial change, and our support for RandomState is already very complex and subtle, even for us (see docs, and the new version of SLEP011 which I never had the guts to submit).

I'd be interested in knowing the impact of using Generatos on our CV procedures and meta-estimators, for example.

lorentzenchr · 2020-11-15T10:25:56Z

@NicolasHug I fully agree that we need to agree on this. But I wanted to give it some visibility. I can put it on the agenda of the next dev meeting and/or label it "breaking change" (though I think it does not need to break anything).

One large impact is that Generator will produce different random numbers than RandomState. But those have better properties of randomness and a faster in a several instances (e.g. choice, see also numpy PR #13812) and usable with Cython! See here for a high level overview of differences/improvements.

lorentzenchr · 2020-12-13T12:18:25Z

Decision of https://github.com/scikit-learn/administrative/blob/master/meeting_notes/2020-11-30.md: we keep it in the 1.0 milestone for now. This may, however, change.

albertcthomas · 2020-12-14T20:53:14Z

The new numpy RNGs can also be easily used for parallel random number generation. This would provide a better alternative when creating seeds in bagging for instance (see here). I do not know if there are other such examples in the scikit-learn codebase.

NicolasHug · 2020-12-14T21:07:53Z

That's a nice feature but we can only take advantage of that once we stop supporting RandomState instances. As long as RandomState is supported, we'll have to resort to workarounds like the one in the link you provided.

albertcthomas · 2020-12-14T21:26:03Z

Indeed

…

On Mon 14 Dec 2020 at 22:08, Nicolas Hug ***@***.***> wrote: That's a nice feature but we can only take advantage of that once we stop supporting RandomState instances. As long as RandomState is supported, we'll have to resort to workarounds like the one in the link you provided. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#16988 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZ2DLW2HX6FZ2DHZL5BL43SUZ5DTANCNFSM4MNUGHBQ> .

lorentzenchr · 2021-01-03T12:20:31Z

An implementation of this proposal becomes much easier when numpy 1.17.0, where the new random Generators were introduced, becomes the minimum numpy version.

jamesmyatt · 2021-04-06T16:23:09Z

Sorry for the drive-by comment, but the NEP 29 deprecation date for NumPy 1.16 (13-Jan-2021) has now passed. Now, this doesn't mean that scikit-learn should stop supporting RandomState (even if that would be nice). It just means that it should be OK to use the new numpy.random API now without adapters or defensive programming. It's also worth remembering that if this is implemented as a breaking change, then it might be easiest to introduce it in v<=1.0.0, rather than waiting for v2.0.

From reading this and other threads (e.g. #14042, scikit-learn/enhancement_proposals#24, https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness), it's clear that this is a complicated issue.

NicolasHug · 2021-04-06T16:34:08Z

It just means that it should be OK to use the new numpy.random API now without adapters or defensive programming.

Would you mind elaborating on this? For now I don't understand how NEP29 and the deprecation of numpy 1.16 changes anything to the concerns raised in #16988 (comment). It seems to me that these will be valid concerns for as long as we support RandomState.

jamesmyatt · 2021-04-11T21:24:53Z

It just means that it should be OK to use the new numpy.random API now without adapters or defensive programming.

Would you mind elaborating on this? For now I don't understand how NEP29 and the deprecation of numpy 1.16 changes anything to the concerns raised in #16988 (comment). It seems to me that these will be valid concerns for as long as we support RandomState.

NEP 29 says:

When a project releases a new major or minor version, we recommend that they support at least all minor versions of Python introduced and released in the prior 42 months from the anticipated release date with a minimum of 2 minor versions of Python, and all minor versions of NumPy released in the prior 24 months from the anticipated release date with a minimum of 3 minor versions of NumPy.

i.e. there should be no obligation to support NumPy 1.16 in any major or minor release after Jan 13, 2021.

The main thing is that, if you bump the minimum version of NumPy to 1.17, then you can write things like: if isinstance(random_state, np.random.BitGenerator) without first checking if np.random.BitGenerator exists.

Also, since NumPy 1.17, the RandomState init has permitted a BitGenerator as an input. So the simplest workaround for handling Generator inputs to these functions is to extract the associated bit generator and re-wrap it as a RandomState: np.random.RandomState(rng.bit_generator). This will use the new bit generators along with the legacy API and non-uniform algorithms. However, doing this will not give you many of the main advantages of the new random API. A better long-term solution might be to extract the bit generator from the RandomState input and re-wrap it as a Generator and use the newer, faster, cleaner, more powerful API everywhere.

As for #16988 (comment) specifically, I think that most of those are issues of methods having given better names in the new API than the legacy one (see https://numpy.org/doc/stable/reference/random/index.html#quick-start). I don't think there's any guarantee that two methods with the same name will give the same random number streams, since the many of the algorithms used to convert the random bits to random numbers have been improved, e.g. more efficient implementation, or even that they have exactly the same arguments. The exception is the state related ones, which is even more complicated (e.g. https://numpy.org/doc/stable/reference/random/generated/numpy.random.RandomState.get_state.html).

I hope this is clear; I know it's brief and the topic is complex.

rkern · 2021-08-04T16:31:54Z

FYI, rand, randn, and random_sample should all be considered disrecommended variants and aliases, and you should use the preferred methods on RandomState that also exist on Generator: random() and standard_normal().

For randint(), we use a utility function in scipy to wrap around either a RandomState or Generator. Part of the improvement of Generator.integers() was the semantics of processing the arguments, so there wasn't a great way to retain randint(). The wrapper function uses those new arguments, so there is a tiny bit of thought to be appleid when making that replacement.

I don't think you use any of the others.

grisaitis · 2021-08-11T14:55:45Z

@rkern Thanks for that summary of the API changes.

Regarding the first group of methods...

FYI, rand, randn, and random_sample should all be considered disrecommended variants and aliases, and you should use the preferred methods on RandomState that also exist on Generator: random() and standard_normal().

Would a good first step (PR) be to refactor the calls to rand, etc, to the new random and standard_normal? This seems pretty doable. The arguments look only slightly different, and the new Generator methods are also available on RandomState. So code would be backward compatible with RandomState.

Regarding integers / randint... Not sure the best way forward on that. Any suggestions @NicolasHug @lorentzenchr ?

Edit - it looks like randint is only used for k-means and spectral clustering in the sklearn codebase. And then in lots of examples and benchmarks. If I'm not mistaken there, as far as core sklearn is concerned, refactoring that might not be so bad.

adrinjalali · 2021-08-22T09:53:53Z

Doesn't seem like this is gonna be resolved before the release. Removing the milestone. Please re-tag if necessary, for an appropriate next release since I think you're more aware of the progress on this topic than I am.

lorentzenchr · 2023-04-20T18:24:01Z

As cross ref, see discussions in scipy/scipy#14322 (comment) following comments.

amueller · 2023-04-21T00:20:33Z

@rkern could you maybe either point towards some documentation or explain how the current design helps with the usability questions we describe in https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness and that I mentioned in the other thread?

betatim · 2023-04-21T08:18:28Z

One document I think is relevant and a good read(!) is https://numpy.org/devdocs/reference/random/parallel.html

rkern · 2023-04-21T16:00:54Z

Let me summarize my simplified understanding of the main issue. I know there are more issues, particularly the internal use of parallelism, but let me focus on what I took as the main issue from that document. For most of scikit-learn's classes, you do want all random draws used when calling its methods to be independent, so holding a PRNG instance and drawing from it in sequence works fine. But there are a few exceptions where you do want certain operations on certain objects to repeat their results. In particular, when comparing multiple models using the scores over cross-validation splits, splitters should do the same splits when given the same data.

This special case is an inherent complexity. However, when you only have the relatively inflexible primitives of integer seeds and RandomState, this inherent complexity got amplified and placed on the user. Because the main pattern of passing around a single RandomState doesn't work for CV, you have to have this big document to explain how to use a different pattern in that one special case. The user has to learn this complex mental model because RandomState did not give you, as the API designers, the tools to hide that complexity from them.

With the Generator infrastructure, we have more flexible and capable primitives for you to work with. This will allow you, as implementors, to design the API and implementation of your objects so you can make the choice which objects need to internally repeat their results and which ones should be independent with each call. Or if you do want to expose that choice to the user, you can make it an explicit option (i.e. just a bool argument) rather than an implicit consequence of different styles of PRNG state management. With the Generator infrastructure, you can tell users to create one Generator using np.random.default_rng(seed) and pass that around to all objects. You can use the capabilities of that infrastructure to handle the complexities internally. Users may still have to know that CV splitters use randomness differently (the inherent complexity), but they won't have to change whole patterns of data flow to do that.

Lengthy explanation

The core enabling technology for the Generator infrastructure is SeedSequence. We use this object as the way to convert integer seeds from the user to the initial states for the PRNG algorithm. It takes arbitrary amounts of integer data (one arbitrarily large integer or a sequence of them) and hashes them together carefully to create high quality states for the PRNGs. This will be true for instantiating multiple Generators with closely related seeds like int_seed, [int_seed, 0], [int_seed, 1], etc. (remember that sequences of integers are allowed). They will each create PRNG states that will be usefully independent from one another (within the limits of pseudo-randomness in the first place). This algorithm enabled us to implement a pattern that we called SeedSequence spawning to deterministically create multiple usefully-independent PRNGs for parallel use. I'm going to talk about SeedSequence here so you know about the internals and how it all works, but your users won't have to work with it. We have hidden everything under Generator, so all the users have to do is grab one with default_rng(seed) and pass it around.

default_rng(seed) will take the seed and create a SeedSequence and pass that to the BitGenerator (which is the core uniform PRNG algorithm) and wraps that up in a Generator (which implements all of the fancy non-uniform distributions and is what the user works with). The BitGenerator uses the SeedSequence to get a good PRNG state to start with, but keeps a reference to the SeedSequence for later. So we can reach down to the SeedSequence, spawn new SeedSequences from it, then wrap those up in the BitGenerator and Generator. In the upcoming 1.25 release, we have exposed a Generator.spawn() method that will just do all of that. But you can use a utility function that will do the same until you can depend on 1.25. From here on, I'll just talk about "spawning Generators" for simplicity.

Whenever you get a Generator, you can spawn from it. The internal SeedSequence essentially is building a unique path of the splits it took down its spawn tree and hashing that together to get the (almost certainly) unique internal state. So you can spawn a Generator from a spawned Generator without worry. This means you don't have to do any manual coordination of seeds and just reason about your code locally. You can probably see how this all can be used for actual multiprocessing parallelism, but we can get into that later. Now we want to talk about how to use this make CV splits repeatable over multiple calls to cv.split(X, y), and that's less obvious.

In general, it is dicey to clone RandomState or Generator objects. Except in very special circumstances, you run the risk of having multiple correlated streams of pseudo-randomness when you did not want them. But in the case of cv.split(), you do want that correlation. The main principle to remember is "Only clone it if you own it." That is, only if you own the entire Generator's lifecycle from creation and all of its method calls, then it can be safe to clone it (i.e. copy.deepcopy(rng)) and reuse it to achieve those repeatable effects. With RandomState, it was hard to really own it in that way. If you were passed in a RandomState, there was no good way to get a fresh one deterministically that you could own and cloning the passed-in RandomState violated the principle. The recommendation to pass in an integer seed to the CV APIs was the only way to do this. When you built a RandomState from a seed, you owned it and could reset to that initial state over and over again when you needed to (cloning in effect).

With Generator spawning, you now can take a passed-in Generator, spawn a Generator that you can actually own, and clone it to keep in safe storage for reuse later. We are using this pattern in SciPy. We have quasi-Monte Carlo generators that use pseudo-random scrambling. We want to be able to reset them to their initial state. We coerce the input to a Generator (i.e. rng = default_rng(rng) so no matter if it's None, an integer seed, or a live Generator, we get a Generator), we spawn from it a Generator that we own, then clone a stored, static copy of our owned, spawned Generator. We never call any methods on the stored static copy; it always remains in that original state. Whenever we want to start over (e.g. at the start of cv.split()), we clone a working copy from stored static copy and use the working copy.

All of this is information that the user doesn't need to know. They might need to know that cv.split(X, y) will always return the same results where other objects don't behave that way. That's the inherent complexity that you get a benefit from. But they don't have to know anything about it to correctly construct a CV splitter. The rule is the same for all objects. As the implementors, you get to make the decision.

lorentzenchr · 2023-05-17T23:14:07Z

For cross ref: (draft) SPEC 7 scientific-python/specs#180 mentions scikit-learn.

Otherwise, the user can still draw values from the RNG and change its state. See scikit-learn/scikit-learn#16988 (comment)

grisaitis added the New Feature label Apr 21, 2020

rth mentioned this issue May 1, 2020

SLEP011: Fixing randomness handling in estimators and splitters scikit-learn/enhancement_proposals#24

Closed

lorentzenchr added this to the 1.0 milestone Nov 15, 2020

jason-bentley mentioned this issue Jan 8, 2021

numpy RandomState is now legacy BCG-X-Official/sklearndf#93

Open

NicolasHug mentioned this issue Aug 4, 2021

Support RandomState._bit_generator as input in check_random_state for users with numpy version >= 1.17.0 #20669

Open

adrinjalali removed this from the 1.0 milestone Aug 22, 2021

This was referenced Jan 31, 2022

ENH Replaced RandomState.rand with equivalent uniform #22327

Merged

ENH Replaced RandomState with Generator compatible calls #22271

Merged

BenjaminBossan linked a pull request Jul 20, 2022 that will close this issue

[WIP] Make random_state accept np.random.Generator #23962

Draft

5 tasks

lorentzenchr added the module:utils label Nov 2, 2022

oyamad mentioned this issue Nov 23, 2022

ENH: check_random_state: Accept np.random.Generator QuantEcon/QuantEcon.py#654

Merged

7 tasks

lorentzenchr mentioned this issue Apr 20, 2023

ENH: refactor RNG usage to use np.random.Generator scipy/scipy#14322

Open

stefanv added a commit to stefanv/scikit-image that referenced this issue May 19, 2023

If user provides RNG, spawn it before deepcopying

99b3f39

Otherwise, the user can still draw values from the RNG and change its state. See scikit-learn/scikit-learn#16988 (comment)

stefanv mentioned this issue May 19, 2023

If user provides RNG, spawn it before deepcopying scikit-image/scikit-image#6948

Merged

jarrodmillman pushed a commit to scikit-image/scikit-image that referenced this issue May 22, 2023

If user provides RNG, spawn it before deepcopying (#6948)

f2745cf

Otherwise, the user can still draw values from the RNG and change its state. See scikit-learn/scikit-learn#16988 (comment)

tommyod mentioned this issue May 31, 2023

Reproducibility, linalg speedup, renaming equinor/iterative_ensemble_smoother#106

Merged

rkern mentioned this issue Jun 30, 2023

ENH: Should there be an rng.clone() or similar? numpy/numpy#24086

Open

ymzayek mentioned this issue Oct 27, 2023

[MAINT] Use Numpy Random Generator to replace legacy RandomState nilearn/nilearn#4084

Merged

PeterLombaers mentioned this issue Nov 9, 2023

Force the random number generator to be seeded asreview/asreview#1502

Merged

rkern mentioned this issue Jul 2, 2024

Using rng= keyword argument for NumPy randomness #29315

Open

lorentzenchr mentioned this issue Aug 14, 2024

ENH: stats.Normal: add new continuous distribution infrastructure + normal distribution scipy/scipy#21050

Merged

mcw92 mentioned this issue Aug 23, 2024

Consistently use (deprecated but scikit-learn-compatible) numpy.random.RandomState for random number generation Helmholtz-AI-Energy/special-couscous#2

Merged

rht mentioned this issue Oct 14, 2024

Add Model.rng for SPEC-7 compliant numpy random number generation projectmesa/mesa#2352

Merged

rkern mentioned this issue Dec 2, 2024

BUG: bit generator spawns different child generators despite having the same random state numpy/numpy#27882

Closed

connortann mentioned this issue Jan 28, 2025

ENH: Adopt SPEC 7, use new-style numpy Generators rather than numpy global state shap/shap#3980

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support numpy.random.Generator and/or BitGenerator for random number generation #16988

Support numpy.random.Generator and/or BitGenerator for random number generation #16988

grisaitis commented Apr 21, 2020

rth commented May 1, 2020

NicolasHug commented May 1, 2020

lorentzenchr commented Nov 15, 2020

NicolasHug commented Nov 15, 2020 •

edited

Loading

lorentzenchr commented Nov 15, 2020 •

edited

Loading

lorentzenchr commented Dec 13, 2020

albertcthomas commented Dec 14, 2020

NicolasHug commented Dec 14, 2020

albertcthomas commented Dec 14, 2020 via email

lorentzenchr commented Jan 3, 2021

jamesmyatt commented Apr 6, 2021 •

edited

Loading

NicolasHug commented Apr 6, 2021

jamesmyatt commented Apr 11, 2021 •

edited

Loading

rkern commented Aug 4, 2021

grisaitis commented Aug 11, 2021 •

edited

Loading

adrinjalali commented Aug 22, 2021

lorentzenchr commented Apr 20, 2023

amueller commented Apr 21, 2023

betatim commented Apr 21, 2023

rkern commented Apr 21, 2023

lorentzenchr commented May 17, 2023

Support numpy.random.Generator and/or BitGenerator for random number generation #16988

Support numpy.random.Generator and/or BitGenerator for random number generation #16988

Comments

grisaitis commented Apr 21, 2020

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

rth commented May 1, 2020

NicolasHug commented May 1, 2020

lorentzenchr commented Nov 15, 2020

NicolasHug commented Nov 15, 2020 • edited Loading

lorentzenchr commented Nov 15, 2020 • edited Loading

lorentzenchr commented Dec 13, 2020

albertcthomas commented Dec 14, 2020

NicolasHug commented Dec 14, 2020

albertcthomas commented Dec 14, 2020 via email

lorentzenchr commented Jan 3, 2021

jamesmyatt commented Apr 6, 2021 • edited Loading

NicolasHug commented Apr 6, 2021

jamesmyatt commented Apr 11, 2021 • edited Loading

rkern commented Aug 4, 2021

grisaitis commented Aug 11, 2021 • edited Loading

adrinjalali commented Aug 22, 2021

lorentzenchr commented Apr 20, 2023

amueller commented Apr 21, 2023

betatim commented Apr 21, 2023

rkern commented Apr 21, 2023

lorentzenchr commented May 17, 2023

NicolasHug commented Nov 15, 2020 •

edited

Loading

lorentzenchr commented Nov 15, 2020 •

edited

Loading

jamesmyatt commented Apr 6, 2021 •

edited

Loading

jamesmyatt commented Apr 11, 2021 •

edited

Loading

grisaitis commented Aug 11, 2021 •

edited

Loading