-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Support numpy.random.Generator and/or BitGenerator for random number generation #16988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @grisaitis I agree this would be useful and it's part of the larger discusion on the random_state API #14042 |
These are the attributes/methods that aren't supported in Generator: rd = np.random.RandomState(0)
gen = default_rng(0)
set(dir(rd)) - set(dir(gen))
{'get_state',
'rand',
'randint',
'randn',
'random_integers',
'random_sample',
'seed',
'set_state',
'tomaxint'} and this is where we use (some of) them in the API:
If we want to support generators, I'm afraid that means we have to start wrapping |
Before milestoning it I think we need to collectively agree this is something that we want to support. A 1.0 milestone seems unrealistic to me considering that we aim at releasing 1.0 in place of 0.25/0.26 As noted above it would not be a trivial change, and our support for I'd be interested in knowing the impact of using Generatos on our CV procedures and meta-estimators, for example. |
@NicolasHug I fully agree that we need to agree on this. But I wanted to give it some visibility. I can put it on the agenda of the next dev meeting and/or label it "breaking change" (though I think it does not need to break anything). One large impact is that |
Decision of https://github.com/scikit-learn/administrative/blob/master/meeting_notes/2020-11-30.md: we keep it in the 1.0 milestone for now. This may, however, change. |
The new numpy RNGs can also be easily used for parallel random number generation. This would provide a better alternative when creating seeds in bagging for instance (see here). I do not know if there are other such examples in the scikit-learn codebase. |
That's a nice feature but we can only take advantage of that once we stop supporting |
Indeed
…On Mon 14 Dec 2020 at 22:08, Nicolas Hug ***@***.***> wrote:
That's a nice feature but we can only take advantage of that once we stop
supporting RandomState instances. As long as RandomState is supported,
we'll have to resort to workarounds like the one in the link you provided.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#16988 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZ2DLW2HX6FZ2DHZL5BL43SUZ5DTANCNFSM4MNUGHBQ>
.
|
An implementation of this proposal becomes much easier when numpy 1.17.0, where the new random Generators were introduced, becomes the minimum numpy version. |
Sorry for the drive-by comment, but the NEP 29 deprecation date for NumPy 1.16 (13-Jan-2021) has now passed. Now, this doesn't mean that scikit-learn should stop supporting RandomState (even if that would be nice). It just means that it should be OK to use the new numpy.random API now without adapters or defensive programming. It's also worth remembering that if this is implemented as a breaking change, then it might be easiest to introduce it in v<=1.0.0, rather than waiting for v2.0. From reading this and other threads (e.g. #14042, scikit-learn/enhancement_proposals#24, https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness), it's clear that this is a complicated issue. |
Would you mind elaborating on this? For now I don't understand how NEP29 and the deprecation of numpy 1.16 changes anything to the concerns raised in #16988 (comment). It seems to me that these will be valid concerns for as long as we support |
NEP 29 says:
i.e. there should be no obligation to support NumPy 1.16 in any major or minor release after Jan 13, 2021. The main thing is that, if you bump the minimum version of NumPy to 1.17, then you can write things like: Also, since NumPy 1.17, the As for #16988 (comment) specifically, I think that most of those are issues of methods having given better names in the new API than the legacy one (see https://numpy.org/doc/stable/reference/random/index.html#quick-start). I don't think there's any guarantee that two methods with the same name will give the same random number streams, since the many of the algorithms used to convert the random bits to random numbers have been improved, e.g. more efficient implementation, or even that they have exactly the same arguments. The exception is the state related ones, which is even more complicated (e.g. https://numpy.org/doc/stable/reference/random/generated/numpy.random.RandomState.get_state.html). I hope this is clear; I know it's brief and the topic is complex. |
FYI, For I don't think you use any of the others. |
@rkern Thanks for that summary of the API changes. Regarding the first group of methods...
Would a good first step (PR) be to refactor the calls to Regarding Edit - it looks like |
Doesn't seem like this is gonna be resolved before the release. Removing the milestone. Please re-tag if necessary, for an appropriate next release since I think you're more aware of the progress on this topic than I am. |
As cross ref, see discussions in scipy/scipy#14322 (comment) following comments. |
@rkern could you maybe either point towards some documentation or explain how the current design helps with the usability questions we describe in https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness and that I mentioned in the other thread? |
One document I think is relevant and a good read(!) is https://numpy.org/devdocs/reference/random/parallel.html |
Let me summarize my simplified understanding of the main issue. I know there are more issues, particularly the internal use of parallelism, but let me focus on what I took as the main issue from that document. For most of scikit-learn's classes, you do want all random draws used when calling its methods to be independent, so holding a PRNG instance and drawing from it in sequence works fine. But there are a few exceptions where you do want certain operations on certain objects to repeat their results. In particular, when comparing multiple models using the scores over cross-validation splits, splitters should do the same splits when given the same data. This special case is an inherent complexity. However, when you only have the relatively inflexible primitives of integer seeds and With the Lengthy explanationThe core enabling technology for the
Whenever you get a In general, it is dicey to clone With All of this is information that the user doesn't need to know. They might need to know that |
For cross ref: (draft) SPEC 7 scientific-python/specs#180 mentions scikit-learn. |
Otherwise, the user can still draw values from the RNG and change its state. See scikit-learn/scikit-learn#16988 (comment)
Otherwise, the user can still draw values from the RNG and change its state. See scikit-learn/scikit-learn#16988 (comment)
Describe the workflow you want to enable
I'd like to use a
Generator
orBitGenerator
with scikit-learn where I'd otherwise useRandomState
or aseed
int.For example:
and then use this for
random_state=
in scikit-learn:This fails because these methods expect a
RandomState
object orint
seed value. The specific trigger ischeck_random_state(random_state)
.Describe your proposed solution
This would require:
Generator
orBitGenerator
as acceptable values forrandom_state=..
in every function and class constructor that acceptsrandom_state
.check_random_state()
to allowGenerator
and/orBitGenerator
objects.Generator
orBitGenerator
with classes or functions that consumerandom_state
(similar toseed
int orRandomState
objects already)RandomState
methods that aren't available withGenerator
(e.g.rand
,randn
, see )Generator
instead ofRandomState
by default, when seed int is givenDescribe alternatives you've considered, if relevant
The scope could include either or both of
BitGenerator
orGenerator
.It might be easiest to allow only
BitGenerator
, and notGenerator
.seed
int value).BitGenerator
can be given toRandomState
, and I think it then produces the same values asGenerator
.Additional context
NumPy v1.17 added the
numpy.random.Generator
(docs) interface for random number generation.Overview:
Generator
is similar toRandomState
, but enables different PRNG algorithmsBitGenerator
(docs) encapsulates the PRNG and seed value, e.g.PCG64(seed=0)
RandomState
"is considered frozen" and uses "the slow Mersenne Twister" by default (docs)RandomState
can work with non-MersenneBitGenerator
objectsThe API for
Generator
andBitGenerator
looks like:The text was updated successfully, but these errors were encountered: