Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add recipe for a version of random() with a larger population #22664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 13, 2020

Conversation

rhettinger
Copy link
Contributor

No description provided.

@rhettinger
Copy link
Contributor Author

I've asked Allen Downey to take a look at this as well.

@AllenDowney
Copy link

The code looks good. I put some tests in this Jupyter notebook:

https://colab.research.google.com/github/AllenDowney/ExercisesInC/blob/master/examples/full_random.ipynb

It passes a visual test that the distribution is uniform from 0 to 1.

I also tried out an implementation closer to what's in the paper. Both work, but Raymond's is a bit faster.

One question: in the last line, why not use ldexp?

@AllenDowney
Copy link

Although it occurs to me that random could get most of the benefit just by generating more bits.

I see that it generates 56 bits and then shifts 3 of them away. Why not use them all?

BPF = 56   
RECIP_BPF = 2 ** -BPF
def random(self):
        """Get the next random number in the range [0.0, 1.0)."""
        return int.from_bytes(_urandom(7), 'big') * RECIP_BPF

Copy link
Member

@tim-one tim-one left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about possible bias in subnormal results as in the ldexp() version I looked at offline.

A "geometric" explanation may be intuitively helpful: this is like throwing a dart at random at [0, 1), then picking the closest representable float at or to the left. The one in the paper is like throwing the dart, but picking the closest representable float in either direction, which leaves an exact power of 2 less likely to be picked than any other float in its binade, but more likely to be picked than any float in the binade preceding it.

Either of those is justifiable, but I prefer what this code does, because it's easier to explain (nothing special about an exact power of 2).

About the slight bias in denorm cases, I really don't care - but you might 😉 .

while not x:
x = getrandbits(32)
exponent += x.bit_length() - 32
return mantissa * 2.0 ** exponent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiplication rounds, so when this slobbers into the denorm range, nearest/even rounding will give a slight bias toward 0 in the last retained bit. ldexp() on Windows truncates instead, which doesn't introduce bias in the denorm cases; but I believe ldexp() on most other platforms does round.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this use of "slobber"

@tim-one
Copy link
Member

tim-one commented Oct 12, 2020

I see that it generates 56 bits and then shifts 3 of them away. Why not use them all?

That's the SystemRandom class, which isn't used much in this context - very few people want to endure the expense of using a "crypto strength" random source to generate doubles. (The random() almost everyone actually uses is _random_Random_random_impl() in _randommodule.c, which uses the method shipped with the Mersenne Twister source code, combining parts of 2 32-bit random integers.)

The point to the shift is so that the floating multiplication is exact. Rounding would introduce numerical complications (e.g., most obviously, 1.0 would suddenly become a possible output; less obviously, the default to-nearest/even rounding could introduce bias in the last retained bit).

@tim-one
Copy link
Member

tim-one commented Oct 12, 2020

The point to the shift is so that the floating multiplication is exact.

Sorry, that was sloppy. The point to the shift is so that the conversion from int to float is exact. The multiplication is exact regardless (we're nowhere near the subnormal range, nor near overflow). But avoiding rounding remains the motivation.

@rhettinger
Copy link
Contributor Author

rhettinger commented Oct 12, 2020

Okay, I switched back to using ldexp(). Ideally, the recipe should handle subnormals as well as possible even though they are highly improbable (one in three googols).

The reason for the mantissa * 2.0 ** exponent is that I thought it would be more clear to a reader when relating it back to the preceding algorithmic explanation. (Unlike library code, recipes are primarily intended to be read and understood.) If I put this in a lightning talk, I would likely go back to the mantissa * 2.0 ** exponent variant. To me, that variant feels more mathematical and less computer-sciency, making it easier to explain.

I also considered using floor(log2(x)) instead of x.bit_length() but preferred to stay in the domain of ints until the last step. And in this case, the Python method name bit_length is clearer that mathy version (which is likely only suitable for a post on math exchange or in a paper).

@tim-one
Copy link
Member

tim-one commented Oct 12, 2020

Any worries I had about subnormals and a slight bias for an exact 0.0 were tempered by the realization that these were highly improbable (one in three googols).

Actually, under the "closest representable float <= a truly random real in [0, 1)" view, 0 will be delivered with the right probability, provided ldexp() truncates (as it does on Windows). And it's not one in three googols, it's more like one in a googol cubed (10**300). Given that, I really have only a slight preference for ldexp() over * 2.0 ** whatever, but for a reason that hasn't been mentioned: there's no cross-platform guarantee that 2.0 ** n will return the mathematical 2**n - pow() is one of the hardest of the "basic" transcendental functions to implement with a "strictly less than 1 ulp" worst-case error bound, and in the bad old days I routinely bumped into libm pows that got the last dozen bits wrong.

@rhettinger
Copy link
Contributor Author

Fair enough. ldexp() is specialized to assemble a float from a mantissa and an integer exponent. So, if it is well implemented, it should do at least as well and possibly better than any other way of doing it.

@rhettinger
Copy link
Contributor Author

FWIW, here is the test code I've been using:

from statistics import mean, stdev, quantiles
from pprint import pp
from math import sqrt, log2

data = [full_random() for i in range(1_000_000)]
print(f'{mean(data)=}  {stdev(data)=} compare with {sqrt(1/12)=}')
print(min(data), quantiles(data), max(data))

pp(sorted(Counter(log2(x.as_integer_ratio()[1]) for x in data).items()))

That gives this output:

mean(data)=0.49961382792812314  stdev(data)=0.2888614304304414 compare with sqrt(1/12)=0.28867513459481287
1.911638086796513e-06 [0.24927708964875345, 0.4994823867863665, 0.7502134596642684] 0.9999995162312929
[(31.0, 1),
 (35.0, 1),
 (36.0, 1),
 (37.0, 6),
 (38.0, 10),
 (39.0, 17),
 (40.0, 35),
 (41.0, 77),
 (42.0, 147),
 (43.0, 342),
 (44.0, 666),
 (45.0, 1234),
 (46.0, 2648),
 (47.0, 5206),
 (48.0, 10414),
 (49.0, 20871),
 (50.0, 41495),
 (51.0, 83460),
 (52.0, 166489),
 (53.0, 333701),
 (54.0, 166237),
 (55.0, 83324),
 (56.0, 41743),
 (57.0, 20975),
 (58.0, 10494),
 (59.0, 5240),
 (60.0, 2574),
 (61.0, 1302),
 (62.0, 650),
 (63.0, 303),
 (64.0, 167),
 (65.0, 80),
 (66.0, 41),
 (67.0, 23),
 (68.0, 15),
 (69.0, 7),
 (70.0, 3),
 (71.0, 1)]

@rhettinger
Copy link
Contributor Author

Any comments on the wording of the comment, docstring, or introductory paragraphs?

I haven't yet had a chance to test its intelligibility on my students.

@AllenDowney
Copy link

AllenDowney commented Oct 12, 2020

About the docstring: it took me a few passes to get it. It is explained in terms of the denominator of a rational number, which makes sense.

I think of it differently, in terms of equally-spaced points on the number line being mapped to floating-point values, which are not equally spaced. So some of them get chosen many times and some (in fact, a large majority) never get chosen at all.

I think your way of explaining it is fine -- it's just not the way I thought of it.

It's probably best to test it on an audience that's not me.

@tim-one
Copy link
Member

tim-one commented Oct 12, 2020

I also considered using floor(log2(x))

That would be a Bad Idea. log2() is not a primitive in some platform libms, and we emulate it then with log2(m * 2**e) = log(m) / log(2) + e. Although that's numerically unstable in some areas due to cancellation, and we use a different expression then. Regardless, there are 3 distinct sources of rounding error, and it's quite possible get a float a tiny bit less than a mathematically exact integer result.

There's never a reason to "apologize" for using exact integer operations whenever possible 😉

@tim-one
Copy link
Member

tim-one commented Oct 12, 2020

Any comments on the wording of the comment, docstring, or introductory paragraphs?

It all depends on how well the reader understands the basics of floating point representations. Nobody comes to that with useful intuitions - they have to unlearn lots of what they think they "know".

Already gave a visual metaphor, throwing a dart uniformly across the real [0, 1) clopen interval. If they understand that representable floats are unevenly spaced, and how, then "move to the closest one <=" should be extremely easy to picture. But if they don't understand how representable floats are distributed, you're going to need at least a page of explanation with several diagrams.

@tim-one
Copy link
Member

tim-one commented Oct 12, 2020

Depending on the reader, they may find this easiest to grasp. The default Random.random() returns

randrange(2**53) / 2**53

where there's no ambiguity because all operations are exact in float arithmetic.

full_random() returns the closest representable float less than or equal to the mathematical

randrange(2**1074) / 2**1074

Note that 1 / 2**1074 is the smallest non-zero positive representable float. If they understand the notation, this makes it crystal clear that it's "as uniform as possible".

@tim-one
Copy link
Member

tim-one commented Oct 12, 2020

I doubt this is worth adding, but since I already wrote it ... it was a sanity check on the randrange(2**1074) explanation. The implementation of full_random() could be sold as an optimization of this. Unfortunately, avoiding all rounding makes it more complicated than the 1-liner I originally had in my head 😉:

EDIT: replaced the code with a more compact, more uniform, branch-free work-a-like. Now it's at least close to what was in my head 😉.

If there's any confusion about why this works, the key is in something that's obvious, but perhaps only in hindsight: every finite IEEE-754 double is, mathematically, an integer multiple of 2**-1074. So range(2**1074) is an equally spaced range of numerators such that, when divided by 2**1074, contains every representable double in the clopen real range [0.0, 1.0).

    def slow_full_random():
        m = getrandbits(1074)
        # Conceptually, we want truncating ldexp(m, -1074), but we don't
        # want any rounding anywhere. We need to cut `m` back to at
        # most 53 significant bits so conversion to float is exact.
        excess = max(m.bit_length() - 53, 0)
        return ldexp(m >> excess, excess - 1074)

@rhettinger
Copy link
Contributor Author

Now I feel vindicated for my early draft designed to fit in a tweet ;-)

https://twitter.com/raymondh/status/1314995894492692481

@rhettinger
Copy link
Contributor Author

rhettinger commented Oct 13, 2020

Draft text to introduce the recipe:

The default random() returns multiples of 2⁻⁵³ in the
range 0.0 ≤ x < 1.0. All such numbers are evenly spaced
and exactly representable as Python floats.

However, many floats in that interval are not possible selections.
For example, 0.05954861408025609 isn't an integer multiple of 2⁻⁵³.

The following recipe takes a different approach. All floats in
the interval are possible selections. Conceptually the way it works
is by choosing from evenly spaced multiples of 2⁻¹⁰⁷⁴ and then rounding
down to the nearest representable float.

For efficiency, the actual mechanics involve calling math.ldexp
to construct a representable float from a mantissa and exponent.
The mantissa is chosen from a uniform distribution of integers in the range
2⁵² ≤ mantissa < 2⁵³. The exponent is chosen from a geometric
distribution where exponents smaller than -53 occur half as often
as the next larger exponent.

from random import getrandbits
from math import ldexp

def full_random():
    ''' Uniform distribution from all possible floats
        in the interval 0.0 <= X < 1.0.
    '''
    mantissa = 0x10_0000_0000_0000 | getrandbits(52)
    exponent = -53
    x = 0
    while not x:
        x = getrandbits(32)
        exponent += x.bit_length() - 32
    return ldexp(mantissa, exponent)

@rhettinger rhettinger merged commit 8b2ff4c into python:master Oct 13, 2020
@miss-islington
Copy link
Contributor

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Sorry @rhettinger, I had trouble checking out the 3.9 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 8b2ff4c03d150c43df3e8438d323b7f7bfe3353c 3.9

@rhettinger rhettinger added needs backport to 3.9 only security fixes and removed needs backport to 3.9 only security fixes labels Oct 13, 2020
@miss-islington
Copy link
Contributor

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9.
🐍🍒⛏🤖

@bedevere-bot bedevere-bot removed the needs backport to 3.9 only security fixes label Oct 13, 2020
@bedevere-bot
Copy link

GH-22684 is a backport of this pull request to the 3.9 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Oct 13, 2020
xzy3 pushed a commit to xzy3/cpython that referenced this pull request Oct 18, 2020
adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir skip issue skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants