Add recipe for a version of random() with a larger population #22664

rhettinger · 2020-10-12T05:47:44Z

No description provided.

rhettinger · 2020-10-12T05:51:43Z

I've asked Allen Downey to take a look at this as well.

AllenDowney · 2020-10-12T14:09:56Z

The code looks good. I put some tests in this Jupyter notebook:

https://colab.research.google.com/github/AllenDowney/ExercisesInC/blob/master/examples/full_random.ipynb

It passes a visual test that the distribution is uniform from 0 to 1.

I also tried out an implementation closer to what's in the paper. Both work, but Raymond's is a bit faster.

One question: in the last line, why not use ldexp?

AllenDowney · 2020-10-12T14:36:29Z

Although it occurs to me that random could get most of the benefit just by generating more bits.

I see that it generates 56 bits and then shifts 3 of them away. Why not use them all?

BPF = 56   
RECIP_BPF = 2 ** -BPF
def random(self):
        """Get the next random number in the range [0.0, 1.0)."""
        return int.from_bytes(_urandom(7), 'big') * RECIP_BPF

tim-one

Same comment about possible bias in subnormal results as in the ldexp() version I looked at offline.

A "geometric" explanation may be intuitively helpful: this is like throwing a dart at random at [0, 1), then picking the closest representable float at or to the left. The one in the paper is like throwing the dart, but picking the closest representable float in either direction, which leaves an exact power of 2 less likely to be picked than any other float in its binade, but more likely to be picked than any float in the binade preceding it.

Either of those is justifiable, but I prefer what this code does, because it's easier to explain (nothing special about an exact power of 2).

About the slight bias in denorm cases, I really don't care - but you might 😉 .

tim-one · 2020-10-12T14:28:10Z

Doc/library/random.rst

+        while not x:
+            x = getrandbits(32)
+            exponent += x.bit_length() - 32
+        return mantissa * 2.0 ** exponent


Multiplication rounds, so when this slobbers into the denorm range, nearest/even rounding will give a slight bias toward 0 in the last retained bit. ldexp() on Windows truncates instead, which doesn't introduce bias in the denorm cases; but I believe ldexp() on most other platforms does round.

+1 for this use of "slobber"

tim-one · 2020-10-12T15:12:07Z

I see that it generates 56 bits and then shifts 3 of them away. Why not use them all?

That's the SystemRandom class, which isn't used much in this context - very few people want to endure the expense of using a "crypto strength" random source to generate doubles. (The random() almost everyone actually uses is _random_Random_random_impl() in _randommodule.c, which uses the method shipped with the Mersenne Twister source code, combining parts of 2 32-bit random integers.)

The point to the shift is so that the floating multiplication is exact. Rounding would introduce numerical complications (e.g., most obviously, 1.0 would suddenly become a possible output; less obviously, the default to-nearest/even rounding could introduce bias in the last retained bit).

tim-one · 2020-10-12T16:04:17Z

The point to the shift is so that the floating multiplication is exact.

Sorry, that was sloppy. The point to the shift is so that the conversion from int to float is exact. The multiplication is exact regardless (we're nowhere near the subnormal range, nor near overflow). But avoiding rounding remains the motivation.

rhettinger · 2020-10-12T17:32:22Z

Okay, I switched back to using ldexp(). Ideally, the recipe should handle subnormals as well as possible even though they are highly improbable (one in three googols).

The reason for the mantissa * 2.0 ** exponent is that I thought it would be more clear to a reader when relating it back to the preceding algorithmic explanation. (Unlike library code, recipes are primarily intended to be read and understood.) If I put this in a lightning talk, I would likely go back to the mantissa * 2.0 ** exponent variant. To me, that variant feels more mathematical and less computer-sciency, making it easier to explain.

I also considered using floor(log2(x)) instead of x.bit_length() but preferred to stay in the domain of ints until the last step. And in this case, the Python method name bit_length is clearer that mathy version (which is likely only suitable for a post on math exchange or in a paper).

tim-one · 2020-10-12T17:56:32Z

Any worries I had about subnormals and a slight bias for an exact 0.0 were tempered by the realization that these were highly improbable (one in three googols).

Actually, under the "closest representable float <= a truly random real in [0, 1)" view, 0 will be delivered with the right probability, provided ldexp() truncates (as it does on Windows). And it's not one in three googols, it's more like one in a googol cubed (10**300). Given that, I really have only a slight preference for ldexp() over * 2.0 ** whatever, but for a reason that hasn't been mentioned: there's no cross-platform guarantee that 2.0 ** n will return the mathematical 2**n - pow() is one of the hardest of the "basic" transcendental functions to implement with a "strictly less than 1 ulp" worst-case error bound, and in the bad old days I routinely bumped into libm pows that got the last dozen bits wrong.

rhettinger · 2020-10-12T18:07:05Z

Fair enough. ldexp() is specialized to assemble a float from a mantissa and an integer exponent. So, if it is well implemented, it should do at least as well and possibly better than any other way of doing it.

rhettinger · 2020-10-12T18:12:36Z

FWIW, here is the test code I've been using:

from statistics import mean, stdev, quantiles
from pprint import pp
from math import sqrt, log2

data = [full_random() for i in range(1_000_000)]
print(f'{mean(data)=}  {stdev(data)=} compare with {sqrt(1/12)=}')
print(min(data), quantiles(data), max(data))

pp(sorted(Counter(log2(x.as_integer_ratio()[1]) for x in data).items()))

That gives this output:

mean(data)=0.49961382792812314  stdev(data)=0.2888614304304414 compare with sqrt(1/12)=0.28867513459481287
1.911638086796513e-06 [0.24927708964875345, 0.4994823867863665, 0.7502134596642684] 0.9999995162312929
[(31.0, 1),
 (35.0, 1),
 (36.0, 1),
 (37.0, 6),
 (38.0, 10),
 (39.0, 17),
 (40.0, 35),
 (41.0, 77),
 (42.0, 147),
 (43.0, 342),
 (44.0, 666),
 (45.0, 1234),
 (46.0, 2648),
 (47.0, 5206),
 (48.0, 10414),
 (49.0, 20871),
 (50.0, 41495),
 (51.0, 83460),
 (52.0, 166489),
 (53.0, 333701),
 (54.0, 166237),
 (55.0, 83324),
 (56.0, 41743),
 (57.0, 20975),
 (58.0, 10494),
 (59.0, 5240),
 (60.0, 2574),
 (61.0, 1302),
 (62.0, 650),
 (63.0, 303),
 (64.0, 167),
 (65.0, 80),
 (66.0, 41),
 (67.0, 23),
 (68.0, 15),
 (69.0, 7),
 (70.0, 3),
 (71.0, 1)]

rhettinger · 2020-10-12T18:22:04Z

Any comments on the wording of the comment, docstring, or introductory paragraphs?

I haven't yet had a chance to test its intelligibility on my students.

AllenDowney · 2020-10-12T18:37:26Z

About the docstring: it took me a few passes to get it. It is explained in terms of the denominator of a rational number, which makes sense.

I think of it differently, in terms of equally-spaced points on the number line being mapped to floating-point values, which are not equally spaced. So some of them get chosen many times and some (in fact, a large majority) never get chosen at all.

I think your way of explaining it is fine -- it's just not the way I thought of it.

It's probably best to test it on an audience that's not me.

tim-one · 2020-10-12T18:38:47Z

I also considered using floor(log2(x))

That would be a Bad Idea. log2() is not a primitive in some platform libms, and we emulate it then with log2(m * 2**e) = log(m) / log(2) + e. Although that's numerically unstable in some areas due to cancellation, and we use a different expression then. Regardless, there are 3 distinct sources of rounding error, and it's quite possible get a float a tiny bit less than a mathematically exact integer result.

There's never a reason to "apologize" for using exact integer operations whenever possible 😉

tim-one · 2020-10-12T18:46:11Z

Any comments on the wording of the comment, docstring, or introductory paragraphs?

It all depends on how well the reader understands the basics of floating point representations. Nobody comes to that with useful intuitions - they have to unlearn lots of what they think they "know".

Already gave a visual metaphor, throwing a dart uniformly across the real [0, 1) clopen interval. If they understand that representable floats are unevenly spaced, and how, then "move to the closest one <=" should be extremely easy to picture. But if they don't understand how representable floats are distributed, you're going to need at least a page of explanation with several diagrams.

tim-one · 2020-10-12T19:25:35Z

Depending on the reader, they may find this easiest to grasp. The default Random.random() returns

randrange(2**53) / 2**53

where there's no ambiguity because all operations are exact in float arithmetic.

full_random() returns the closest representable float less than or equal to the mathematical

randrange(2**1074) / 2**1074

Note that 1 / 2**1074 is the smallest non-zero positive representable float. If they understand the notation, this makes it crystal clear that it's "as uniform as possible".

tim-one · 2020-10-12T20:33:51Z

I doubt this is worth adding, but since I already wrote it ... it was a sanity check on the randrange(2**1074) explanation. The implementation of full_random() could be sold as an optimization of this. Unfortunately, avoiding all rounding makes it more complicated than the 1-liner I originally had in my head 😉:

EDIT: replaced the code with a more compact, more uniform, branch-free work-a-like. Now it's at least close to what was in my head 😉.

If there's any confusion about why this works, the key is in something that's obvious, but perhaps only in hindsight: every finite IEEE-754 double is, mathematically, an integer multiple of 2**-1074. So range(2**1074) is an equally spaced range of numerators such that, when divided by 2**1074, contains every representable double in the clopen real range [0.0, 1.0).

    def slow_full_random():
        m = getrandbits(1074)
        # Conceptually, we want truncating ldexp(m, -1074), but we don't
        # want any rounding anywhere. We need to cut `m` back to at
        # most 53 significant bits so conversion to float is exact.
        excess = max(m.bit_length() - 53, 0)
        return ldexp(m >> excess, excess - 1074)

rhettinger · 2020-10-13T02:50:39Z

Now I feel vindicated for my early draft designed to fit in a tweet ;-)

https://twitter.com/raymondh/status/1314995894492692481

rhettinger · 2020-10-13T07:17:51Z

Draft text to introduce the recipe:

The default random() returns multiples of 2⁻⁵³ in the
range 0.0 ≤ x < 1.0. All such numbers are evenly spaced
and exactly representable as Python floats.

However, many floats in that interval are not possible selections.
For example, 0.05954861408025609 isn't an integer multiple of 2⁻⁵³.

The following recipe takes a different approach. All floats in
the interval are possible selections. Conceptually the way it works
is by choosing from evenly spaced multiples of 2⁻¹⁰⁷⁴ and then rounding
down to the nearest representable float.

For efficiency, the actual mechanics involve calling math.ldexp
to construct a representable float from a mantissa and exponent.
The mantissa is chosen from a uniform distribution of integers in the range
2⁵² ≤ mantissa < 2⁵³. The exponent is chosen from a geometric
distribution where exponents smaller than -53 occur half as often
as the next larger exponent.

from random import getrandbits
from math import ldexp

def full_random():
    ''' Uniform distribution from all possible floats
        in the interval 0.0 <= X < 1.0.
    '''
    mantissa = 0x10_0000_0000_0000 | getrandbits(52)
    exponent = -53
    x = 0
    while not x:
        x = getrandbits(32)
        exponent += x.bit_length() - 32
    return ldexp(mantissa, exponent)

miss-islington · 2020-10-13T18:54:24Z

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9.
🐍🍒⛏🤖

miss-islington · 2020-10-13T18:54:31Z

Sorry @rhettinger, I had trouble checking out the 3.9 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 8b2ff4c03d150c43df3e8438d323b7f7bfe3353c 3.9

miss-islington · 2020-10-13T18:56:35Z

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9.
🐍🍒⛏🤖

bedevere-bot · 2020-10-13T18:56:47Z

GH-22684 is a backport of this pull request to the 3.9 branch.

…GH-22664) (cherry picked from commit 8b2ff4c) Co-authored-by: Raymond Hettinger <[email protected]>

…) (GH-22684)

…GH-22664)

rhettinger added 2 commits October 11, 2020 22:10

First draft

420cf79

Fix intervals

4bf5e77

rhettinger added docs Documentation in the Doc dir skip issue skip news needs backport to 3.9 only security fixes labels Oct 12, 2020

rhettinger requested a review from tim-one October 12, 2020 05:47

the-knights-who-say-ni added the CLA signed label Oct 12, 2020

bedevere-bot added the awaiting core review label Oct 12, 2020

Fix markup for inline code

964c5bf

tim-one reviewed Oct 12, 2020

View reviewed changes

Use ldexp() to assemble the float.

613e175

Improve explanatory text and switch to class based example

d239345

rhettinger merged commit 8b2ff4c into python:master Oct 13, 2020

bedevere-bot removed the awaiting core review label Oct 13, 2020

miss-islington assigned rhettinger Oct 13, 2020

rhettinger added needs backport to 3.9 only security fixes and removed needs backport to 3.9 only security fixes labels Oct 13, 2020

bedevere-bot removed the needs backport to 3.9 only security fixes label Oct 13, 2020

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Oct 13, 2020

Add recipe for a version of random() with a larger population (python…

6ed32d7

…GH-22664) (cherry picked from commit 8b2ff4c) Co-authored-by: Raymond Hettinger <[email protected]>

rhettinger pushed a commit that referenced this pull request Oct 13, 2020

Add recipe for a version of random() with a larger population (GH-22664…

5f0007f

…) (GH-22684)

xzy3 pushed a commit to xzy3/cpython that referenced this pull request Oct 18, 2020

Add recipe for a version of random() with a larger population (python…

609262d

…GH-22664)

adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 13, 2021

Add recipe for a version of random() with a larger population (python…

0b6ad00

…GH-22664)

Uh oh!

Add recipe for a version of random() with a larger population #22664

Add recipe for a version of random() with a larger population #22664

Uh oh!

Conversation

rhettinger commented Oct 12, 2020

Uh oh!

rhettinger commented Oct 12, 2020

Uh oh!

AllenDowney commented Oct 12, 2020

Uh oh!

AllenDowney commented Oct 12, 2020

Uh oh!

tim-one left a comment

Choose a reason for hiding this comment

Uh oh!

tim-one Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

AllenDowney Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

tim-one commented Oct 12, 2020

Uh oh!

tim-one commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhettinger commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-one commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhettinger commented Oct 12, 2020

Uh oh!

rhettinger commented Oct 12, 2020

Uh oh!

rhettinger commented Oct 12, 2020

Uh oh!

AllenDowney commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-one commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-one commented Oct 12, 2020

Uh oh!

tim-one commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-one commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhettinger commented Oct 13, 2020

Uh oh!

rhettinger commented Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miss-islington commented Oct 13, 2020

Uh oh!

miss-islington commented Oct 13, 2020

Uh oh!

miss-islington commented Oct 13, 2020

Uh oh!

bedevere-bot commented Oct 13, 2020

Uh oh!

Uh oh!

tim-one commented Oct 12, 2020 •

edited

Loading

rhettinger commented Oct 12, 2020 •

edited

Loading

tim-one commented Oct 12, 2020 •

edited

Loading

AllenDowney commented Oct 12, 2020 •

edited

Loading

tim-one commented Oct 12, 2020 •

edited

Loading

tim-one commented Oct 12, 2020 •

edited

Loading

tim-one commented Oct 12, 2020 •

edited

Loading

rhettinger commented Oct 13, 2020 •

edited

Loading