sum() several times slower on Python 3 64-bit #68264

ambv · 2015-04-29T18:36:58Z

BPO	24076
Nosy	@gvanrossum, @rhettinger, @mdickinson, @scoder, @stevendaprano, @ambv, @serhiy-storchaka, @pablogsal
PRs	bpo-24076: Inline single digit unpacking in the integer fastpath of sum() #28469 bpo-24076: Fix reference in sum() introduced by GH-28469 #28493
Files	pylong_freelist.patch unpack_single_digits.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-09-21.11:22:53.873>
created_at = <Date 2015-04-29.18:36:58.330>
labels = ['interpreter-core', '3.11', 'performance']
title = 'sum() several times slower on Python 3 64-bit'
updated_at = <Date 2021-09-22.10:30:16.952>
user = 'https://github.com/ambv'

bugs.python.org fields:

activity = <Date 2021-09-22.10:30:16.952>
actor = 'pablogsal'
assignee = 'none'
closed = True
closed_date = <Date 2021-09-21.11:22:53.873>
closer = 'scoder'
components = ['Interpreter Core']
creation = <Date 2015-04-29.18:36:58.330>
creator = 'lukasz.langa'
dependencies = []
files = ['39245', '47748']
hgrepos = []
issue_num = 24076
keywords = ['patch']
message_count = 35.0
messages = ['242238', '242241', '242242', '242243', '242244', '242259', '242260', '242262', '242294', '242300', '242302', '242310', '242357', '242914', '323443', '402136', '402201', '402210', '402281', '402282', '402285', '402286', '402295', '402298', '402302', '402306', '402310', '402313', '402315', '402325', '402326', '402328', '402334', '402410', '402421']
nosy_count = 8.0
nosy_names = ['gvanrossum', 'rhettinger', 'mark.dickinson', 'scoder', 'steven.daprano', 'lukasz.langa', 'serhiy.storchaka', 'pablogsal']
pr_nums = ['28469', '28493']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue24076'
versions = ['Python 3.11']

ambv · 2015-04-29T18:36:58Z

I got a report that summing numbers is noticably slower on Python 3. This is easily reproducible:

$ time python2.7 -c "print sum(xrange(3, 10**9, 3)) + sum(xrange(5, 10**9, 5)) - sum(xrange(15, 10**9, 15))"
233333333166666668

real 0m6.165s
user 0m6.100s
sys 0m0.032s

$ time python3.4 -c "print(sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15)))"
233333333166666668

real 0m16.413s
user 0m16.086s
sys 0m0.089s

I can't tell from initial poking what's the core issue here. Both examples produce equivalent bytecode, the builtin_sum() function is only noticably different in the fact that it uses PyLong_* across the board, including PyLong_AsLongAndOverlow. We'll need to profile this, which I didn't have time for yet.

serhiy-storchaka · 2015-04-29T19:02:04Z

Can't reproduce on 32-bit Linux.

$ time python2.7 -c "print sum(xrange(3, 10**9, 3)) + sum(xrange(5, 10**9, 5)) - sum(xrange(15, 10**9, 15))"
233333333166666668

real 1m11.614s
user 1m11.376s
sys 0m0.056s
$ time python3.4 -c "print(sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15)))"
233333333166666668

real 1m11.658s
user 1m10.980s
sys 0m0.572s

$ python2.7 -m timeit -n1 -r1 "sum(xrange(3, 10**9, 3)) + sum(xrange(5, 10**9, 5)) - sum(xrange(15, 10**9, 15))"
1 loops, best of 1: 72 sec per loop
$ python3.4 -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loops, best of 1: 72.5 sec per loop

$ python2.7 -m timeit -s "a = list(range(10**6))" -- "sum(a)"
10 loops, best of 3: 114 msec per loop
$ python3.4 -m timeit -s "a = list(range(10**6))" -- "sum(a)"
10 loops, best of 3: 83.5 msec per loop

What is sys.int_info on your build?

pitrou · 2015-04-29T19:04:22Z

I reproduce under 64-bit Linux. So this may be because the Python long digit (30 bits) is smaller than the C long (64 bits).

Lukasz: is there a specific use case? Note you can use Numpy for such calculations.

ambv · 2015-04-29T19:23:39Z

Serhiy, this is 64-bit specific. Antoine, as far as I can tell, the main use case is: "Don't make it look like migrating to Python 3 is a terrible performance downgrade."

As we discussed on the language summit this year [1], we have to be at least not worse to look appealing. This might be a flawed benchmark but people will make them anyway. In this particular case, there's internal usage at Twitter that unearthed it. The example is just a simplified repro.

Some perf degradations were expected, like switching text to Unicode. In this case, the end result computed by both 2.7 and 3.4 is the same so we should be able to address this.

[1] http://lwn.net/Articles/640224/

pitrou · 2015-04-29T19:34:25Z

If that's due to the different representation of Python 2's int type and Python 3's int type then I don't see an easy solution to this.

mdickinson · 2015-04-30T03:50:26Z

Łukasz: there are three ingredients here - sum, (x)range and the integer addition that sum will be performing at each iteration. Is there any chance you can separate the effects on your machine?

On my machine (OS X, 64-bit), I'm seeing *some* speed difference in the integer arithmetic, but not enough to explain the whole of the timing mismatch.

One thing we've lost in Python 3 is the fast path for small-int addition *inside* the ceval loop. It may be possible to restore something there.

mdickinson · 2015-04-30T04:08:30Z

Throwing out sum, I'm seeing significant slowdown simply from xrange versus range:

taniyama:Desktop mdickinson$ python2 -m timeit -s 'x = xrange(3, 10**9, 3)' 'for e in x: pass'
10 loops, best of 3: 5.01 sec per loop
taniyama:Desktop mdickinson$ python3 -m timeit -s 'x = range(3, 10**9, 3)' 'for e in x: pass'
10 loops, best of 3: 8.62 sec per loop

scoder · 2015-04-30T05:30:12Z

there are three ingredients here - sum, (x)range and the integer addition that sum will be performing at each iteration.

... not to forget the interpreter startup time on his machine. :)

I did a tiny bit of profiling and about 90% of the time seems to be spent creating and deallocating throw-away PyLong objects. My guess is that it simply lacks a free-list in _PyLong_New().

pitrou · 2015-04-30T23:35:12Z

It seems we (like the benchmarks posted) are spending a whole lot of time on something that's probably not relevant to any real-world situation.

If someone has actual code that suffers from this, it would be good to know about it.
(note by the way that summing on a range() can be done O(1): it's just a variation on a arithmetic series)

scoder · 2015-05-01T06:09:57Z

I don't think it's irrelevant. Throw-away integers are really not uncommon. For-loops use them quite often, non-trivial arithmetic expressions can create a lot of intermediate temporaries. Speeding up the create-delete cycle of PyLong sounds like a very obvious thing to do.

Imagine some code that iterates over a list of integers, applies some calculation to them, and then stores them in a new list, maybe even using a list comprehension or so. If you could speed up the intermediate calculation by avoiding overhead in creating temporary PyLong objects, such code could benefit a lot.

I suspect that adding a free-list for single-digit PyLong objects (the most common case) would provide some visible benefit.

pitrou · 2015-05-01T10:44:15Z

Le 01/05/2015 08:09, Stefan Behnel a écrit :

I don't think it's irrelevant. Throw-away integers are really not
uncommon. For-loops use them quite often, non-trivial arithmetic
expressions can create a lot of intermediate temporaries. Speeding up
the create-delete cycle of PyLong sounds like a very obvious thing to do.

That may be a good thing indeed. I'm just saying that the benchmarks
people are worried about here are completely pointless.

scoder · 2015-05-01T14:02:20Z

I tried implementing a freelist. Patch attached, mostly adapted from the one in dictobject.c, but certainly needs a bit of cleanup.

The results are not bad, about 10-20% faster:

Original:

$ ./python -m timeit 'sum(range(1, 100000))'
1000 loops, best of 3: 1.86 msec per loop

$ ./python -m timeit -s 'l = list(range(1000, 10000))' '[(i*2+5) // 7 for i in l]'
1000 loops, best of 3: 1.05 msec per loop

With freelist:

$ ./python -m timeit 'sum(range(1, 100000))'
1000 loops, best of 3: 1.52 msec per loop

$ ./python -m timeit -s 'l = list(range(1000, 10000))' '[(i*2+5) // 7 for i in l]'
1000 loops, best of 3: 931 usec per loop

stevendaprano · 2015-05-01T23:46:20Z

Antoine asked:

If someone has actual code that suffers from this, it would be good to know about it.

You might have missed Łukasz' earlier comment: "In this particular case, there's internal usage at Twitter that unearthed it. The example is just a simplified repro."

scoder · 2015-05-11T20:08:55Z

bpo-24165 was created to pursue the path of a free-list for PyLong objects.

scoder · 2018-08-12T12:36:17Z

FWIW, a PGO build of Py3.7 is now about 20% *faster* here than my Ubuntu 16/04 system Python 2.7, and for some (probably unrelated) reason, the system Python 3.5 is another 2% faster on my side.

IMHO, the only other thing that seems obvious to try would be to inline the unpacking of single digit PyLongs into sum(). I attached a simple patch that does that, in case someone wants to test it out. For non-PGO builds, it's about 17% faster for me. Didn't take the time to benchmark PGO builds with it.

gvanrossum · 2021-09-18T17:39:37Z

@Stefan

FWIW, a PGO build of Py3.7 is now about 20% *faster* here than my Ubuntu 16/04 system Python 2.7

Does that mean we can close this issue? Or do I misunderstand what you are comparing? 32 vs. 64 bits? PGO vs. non-PGO?

OTOH on my Mac I still find that 3.10 with PGO is still more than twice as slow than 2.7.

Thinking about it that's a bit odd, since (presumably) the majority of the work in sum() involves a long int result (even though the values returned by range() all fit in 30 bits, the sum quickly exceeds that).

(earlier)

I suspect that adding a free-list for single-digit PyLong objects (the most common case) would provide some visible benefit.

If my theory is correct that wouldn't help this particular case, right?

FWIW just "for i in [x]range(15, 10**9, 15): pass" is about the same speed in Python 2.7 as in 3.11.

rhettinger · 2021-09-20T07:22:27Z

OTOH on my Mac I still find that 3.10 with PGO is still
more than twice as slow than 2.7.

Thinking about it that's a bit odd, since (presumably)
the majority of the work in sum() involves a long int result
(even though the values returned by range() all fit in 30 bits,
the sum quickly exceeds that).

The actual accumulation of a long int result is still as fast as it ever was.

The main difference from Py2.7 isn't the addition, it is that detecting and extracting a small int added has become expensive.

-- Python 2 fastpath --------------------------------------

            if (PyInt_CheckExact(item)) {               // Very cheap
                long b = PyInt_AS_LONG(item);           // Very cheap
                long x = i_result + b;                  // Very cheap
                if ((x^i_result) >= 0 || (x^b) >= 0) {  // Semi cheap
                    i_result = x;                       // Zero cost 
                    Py_DECREF(item);                    // Most expensive step, but still cheap
                    continue;
                }
            }

-- Python 3 fastpath --------------------------------------

            if (PyLong_CheckExact(item) || PyBool_Check(item)) {     // Cheap
                long b = PyLong_AsLongAndOverflow(item, &overflow);  // Super Expensive
                if (overflow == 0 &&                                 // Branch predictable test
                    (i_result >= 0 ? (b <= LONG_MAX - i_result)      // Slower but better test  
                                   : (b >= LONG_MIN - i_result)))
                {
                    i_result += b;                                    // Very cheap
                    Py_DECREF(item);
                    continue;
                }
            }

-- Supporting function ------------------------------------

long
PyLong_AsLongAndOverflow(PyObject *vv, int *overflow)                 // OMG, this does a lot of work
{
    /* This version by Tim Peters */
    PyLongObject *v;
    unsigned long x, prev;
    long res;
    Py_ssize_t i;
    int sign;
    int do_decref = 0; /* if PyNumber_Index was called */

    *overflow = 0;
    if (vv == NULL) {
        PyErr_BadInternalCall();
        return -1;
    }

    if (PyLong_Check(vv)) {
        v = (PyLongObject *)vv;
    }
    else {
        v = (PyLongObject *)_PyNumber_Index(vv);
        if (v == NULL)
            return -1;
        do_decref = 1;
    }

res = -1;
i = Py_SIZE(v);

    switch (i) {
    case -1:
        res = -(sdigit)v->ob_digit[0];
        break;
    case 0:
        res = 0;
        break;
    case 1:
        res = v->ob_digit[0];
        break;
    default:
        sign = 1;
        x = 0;
        if (i < 0) {
            sign = -1;
            i = -(i);
        }
        while (--i >= 0) {
            prev = x;
            x = (x << PyLong_SHIFT) | v->ob_digit[i];
            if ((x >> PyLong_SHIFT) != prev) {
                *overflow = sign;
                goto exit;
            }
        }
        /* Haven't lost any bits, but casting to long requires extra
         * care (see comment above).
         */
        if (x <= (unsigned long)LONG_MAX) {
            res = (long)x * sign;
        }
        else if (sign < 0 && x == PY_ABS_LONG_MIN) {
            res = LONG_MIN;
        }
        else {
            *overflow = sign;
            /* res is already set to -1 */
        }
    }
  exit:
    if (do_decref) {
        Py_DECREF(v);
    }
    return res;
}

scoder · 2021-09-20T08:13:18Z

I created a PR from my last patch, inlining the unpacking of single digit integers. Since most integers should fit into a single digit these days, this is as fast a path as it gets.

#28469

rhettinger · 2021-09-20T22:32:13Z

I created a PR from my last patch, inlining the unpacking
of single digit integers.

Thanks, that gets to the heart of the issue.

I marked the PR as approved (though there is a small coding nit you may want to fix).

gvanrossum · 2021-09-20T23:17:22Z

The patch looks fine, but it looks a bit like benchmark chasing. Is the speed of builtin sum() of a sequence of integers important enough to do this bit of inlining? (It may break if we change the internals of Py_Long, as Mark Shannon has been wanting to do for a while -- see faster-cpython/ideas#42.)

scoder · 2021-09-21T05:31:20Z

The patch looks fine, but it looks a bit like benchmark chasing. Is the speed of builtin sum() of a sequence of integers important enough to do this bit of inlining?

Given that we already accepted essentially separate loops for the int, float and everything else cases, I think the answer is that it doesn't add much to the triplication.

It may break if we change the internals of Py_Long, as Mark Shannon has been wanting to do for a while

I would assume that such a structural change would come with suitable macros to unpack the special 0-2 digit integers. Those would then apply here, too. As it stands, there are already some modules distributed over the source tree that use direct digit access: ceval.c, _decimal.c, marshal.c. They are easy to find with grep and my PR just adds one more.

gvanrossum · 2021-09-21T05:36:23Z

Sounds good, you have my blessing.

scoder · 2021-09-21T09:01:22Z

New changeset debd804 by scoder in branch 'main':
bpo-24076: Inline single digit unpacking in the integer fastpath of sum() (GH-28469)
debd804

serhiy-storchaka · 2021-09-21T09:21:39Z

What are microbenchmark results for PR 28469 in comparison with the baseline?

scoder · 2021-09-21T11:21:49Z

Old, with PGO:
$ ./python -m timeit -s 'd = list(range(2**61, 2**61 + 10000))' 'sum(d)'
1000 loops, best of 5: 340 usec per loop
$ ./python -m timeit -s 'd = list(range(2**30, 2**30 + 10000))' 'sum(d)'
2000 loops, best of 5: 114 usec per loop
$ ./python -m timeit -s 'd = list(range(2**29, 2**29 + 10000))' 'sum(d)'
5000 loops, best of 5: 73.4 usec per loop
$ ./python -m timeit -s 'd = list(range(10000))' 'sum(d)'
5000 loops, best of 5: 73.3 usec per loop
$ ./python -m timeit -s 'd = [0] * 10000' 'sum(d)'
5000 loops, best of 5: 78.7 usec per loop

New, with PGO:
$ ./python -m timeit -s 'd = list(range(2**61, 2**61 + 10000))' 'sum(d)'
1000 loops, best of 5: 305 usec per loop
$ ./python -m timeit -s 'd = list(range(2**30, 2**30 + 10000))' 'sum(d)'
2000 loops, best of 5: 115 usec per loop
$ ./python -m timeit -s 'd = list(range(2**29, 2**29 + 10000))' 'sum(d)'
5000 loops, best of 5: 52.4 usec per loop
$ ./python -m timeit -s 'd = list(range(10000))' 'sum(d)'
5000 loops, best of 5: 54 usec per loop
$ ./python -m timeit -s 'd = [0] * 10000' 'sum(d)'
5000 loops, best of 5: 45.8 usec per loop

The results are a bit more mixed with PGO optimisation (I tried a couple of times), not sure why. Might just be normal fluctuation, bad benchmark value selection, or accidental PGO tuning, can't say. In any case, the 1-digit case (10000, 2**29) is again about 28% faster and none of the other cases seems (visibly) slower.

I think this is a very clear net-win.

serhiy-storchaka · 2021-09-21T12:48:30Z

Thank you again Stefan. Now no doubts are left.

BTW, pyperf gives more stable results. I use it if have any doubts (either the results of timeit are not stable or the difference is less than say 10%).

pablogsal · 2021-09-21T16:45:45Z

Unfortunately commit debd804 introduced a reference leak:

❯ ./python -m test test_grammar -R :
0:00:00 load avg: 2.96 Run tests sequentially
0:00:00 load avg: 2.96 [1/1] test_grammar
beginning 9 repetitions
123456789
.........
test_grammar leaked [12, 12, 12, 12] references, sum=48
test_grammar failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
test_grammar

Total duration: 1.1 sec
Tests result: FAILURE

debd804 is the first bad commit
commit debd804
Author: scoder <[email protected]>
Date: Tue Sep 21 11:01:18 2021 +0200

bpo-24076: Inline single digit unpacking in the integer fastpath of sum() (GH-28469)

.../Core and Builtins/2021-09-20-10-02-12.bpo-24076.ZFgFSj.rst | 1 +
Python/bltinmodule.c | 10 +++++++++-
2 files changed, 10 insertions(+), 1 deletion(-)
create mode 100644 Misc/NEWS.d/next/Core and Builtins/2021-09-20-10-02-12.bpo-24076.ZFgFSj.rst

pablogsal · 2021-09-21T16:53:17Z

Opened bpo-28493 to fix the refleak

pablogsal · 2021-09-21T16:53:32Z

Sorry, I meant PR 28493

pablogsal · 2021-09-21T17:39:12Z

New changeset 1c7e98d by Pablo Galindo Salgado in branch 'main':
bpo-24076: Fix reference in sum() introduced by python/issues-test-cpython#28469 (GH-28493)
1c7e98d

scoder · 2021-09-22T08:03:05Z

Sorry for that, Pablo. I knew exactly where the problem was, the second I read your notification. Thank you for resolving it so quickly.

pablogsal · 2021-09-22T10:30:17Z

Always happy to help :)

markshannon · 2022-10-20T11:17:38Z

I'm still seeing a large slowdown (comparing the Ubuntu 2.7 and 3.11)

$ time python2.7 -c "print sum(xrange(3, 10**9, 3)) + sum(xrange(5, 10**9, 5)) - sum(xrange(15, 10**9, 15))"
233333333166666668

real	0m2.907s
user	0m2.877s
sys	0m0.025s
$ time python3.11 -c "print(sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15)))"
233333333166666668

real	0m6.915s
user	0m6.898s
sys	0m0.009s

markshannon · 2022-10-20T11:19:11Z

I think the fix is faster-cpython/ideas#147, but we haven't started implementing that.

markshannon · 2022-10-21T14:01:55Z

I'm reopening this as sum() is still quite a lot slower than 2.7

l1t1 · 2022-10-26T02:15:15Z

3.11

D:\python311>python
Python 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)] on win32
>>> import time
>>> t=time.time();sum(range(1,pow(10,8)+1));print(time.time()-t)
5000000050000000
4.157237768173218

vs
3.10

D:\python310>python
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] on win32
>>> import time
>>> t=time.time();sum(range(1,pow(10,8)+1));print(time.time()-t)
5000000050000000
4.183239221572876

l1t1 · 2022-10-26T02:20:26Z

pypy 7.3.9

D:\pypy3.8-v7.3.9-win64>pypy
Python 3.8.12 (0089b4a7ab2306925a251b35912885d52ead1aba, Mar 16 2022, 13:51:04)
[PyPy 7.3.9 with MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>> import time
>>>> t=time.time();sum(range(1,pow(10,8)+1));print(time.time()-t)
5000000050000000
0.1780109405517578

StanFromIreland · 2025-04-24T20:50:47Z

This can be closed. On 3.14:

$ time python3.14 -c "print(sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15)))"
233333333166666668

real	0m3.706s
user	0m3.687s
sys	0m0.005s

Much faster than @ambv benchmarks.

real 0m6.165s
user 0m6.100s
sys 0m0.032s

real 0m16.413s
user 0m16.086s
sys 0m0.089s

Discussion of further optimization belongs elsewhere (faster-cpython).

skirpichev · 2025-04-25T04:23:22Z

This can be closed. On 3.14:

I'm not sure. Absolute numbers in benchmarks aren't relevant here. OP probably run py2 tests on a different system than you.

At least, you should run test for py2 and py3 on same system. Here are my tests.

Py2:

$ python2.7 -m timeit -n1 -r1 "sum(xrange(3, 10**9, 3)) + sum(xrange(5, 10**9, 5)) - sum(xrange(15, 10**9, 15))"
1 loops, best of 1: 9.85 sec per loop

Py3.13:

$ python3.13  -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loop, best of 1: 39.9 sec per loop

Py3.14a7:

$ python3.14  -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loop, best of 1: 22.3 sec per loop

For me it looks like issue is valid and it's a regression from py2, not a feature request. @picnixz ? @ambv ?

picnixz · 2025-04-25T06:43:18Z

Honestly, considering 2.7 has been EOL for a long time, I don't think we need to keep this specific issue open. I don't think we can do anything now and I would indeed prefer if faster-cpython comes up with a solution.

Now, for performance loss, we sometimes treat them as bug, sometimes not. I categorized it as a FR because we won't backport the change I think.

skirpichev · 2025-04-25T07:02:14Z

Honestly, considering 2.7 has been EOL for a long time, I don't think we need to keep this specific issue open.

Why not? If v2.7 is better somewhere - it's still a regression.

I don't think we can do anything now and I would indeed prefer if faster-cpython comes up with a solution.

I'm not sure that closing issue is a right thing even if we can't do anything now. Work is ongoing, e.g.: https://discuss.python.org/t/87950. I think it's a good thing to keep eye on this issue for people involved.

BTW, Py3.14 has impressive speedup, but I doubt it's related to integer arithmetic.

gvanrossum · 2025-04-25T07:43:17Z

I'm with @skirpichev. Can someone at least summarize an explanation of the difference between 2.7 and 3.x?

picnixz · 2025-04-25T07:55:07Z

I think it was analyzed by #68264 (comment) but AFAICT, it's the call to PyLong_AsLongAndOverflow that introduced the regression.

Work is ongoing, e.g.: discuss.python.org/t/87950

I wasn't aware of this one so thanks. By the way, feel free to re-open issues if I close them wrongly!

skirpichev · 2025-04-25T09:01:16Z

Can someone at least summarize an explanation of the difference between 2.7 and 3.x?

My 2c:

one issue was mentioned by @mdickinson: sum() several times slower on Python 3 64-bit #68264 (comment) and gone ~3.11: sum() several times slower on Python 3 64-bit #68264 (comment)
second one was missing specialization for single-digit integers (in a loop): sum() several times slower on Python 3 64-bit #68264 (comment). This was added by GH-101291: Rearrange the size bits in PyLongObject #102464.
something happened on 3.14 (as my benchmarks suggests). I don't know yet.

Here more tests from my zoo:

$ python2.7 -m timeit -n1 -r1 "sum(xrange(3, 10**9, 3)) + sum(xrange(5, 10**9, 5)) - sum(xrange(15, 10**9, 15))"
1 loops, best of 1: 9.85 sec per loop
$ python3.9  -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loop, best of 1: 27.7 sec per loop
$ python3.10  -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loop, best of 1: 28.4 sec per loop
$ python3.11  -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loop, best of 1: 26.1 sec per loop
$ python3.12  -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loop, best of 1: 31.3 sec per loop
$ python3.13  -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loop, best of 1: 33.1 sec per loop
$ python3.14  -m timeit -n1 -r1 "sum(range(3, 10**9, 3)) + sum(range(5, 10**9, 5)) - sum(range(15, 10**9, 15))"
1 loop, best of 1: 21.8 sec per loop

ambv added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Apr 29, 2015

mlouielu mannequin added the 3.7 (EOL) end of life label Apr 11, 2017

mlouielu mannequin changed the title ~~sum() several times slower on Python 3~~ sum() several times slower on Python 3 64-bit Apr 11, 2017

scoder added 3.11 only security fixes and removed 3.7 (EOL) end of life labels Sep 20, 2021

scoder closed this as completed Sep 21, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

CAM-Gerlach mentioned this issue Oct 19, 2022

gh-95913: Edit & expand Optimizations in 3.11 WhatsNew #98426

Merged

markshannon reopened this Oct 21, 2022

This was referenced Aug 11, 2023

Make an O(1) fastpath for sum(range(...)) #107868

Closed

gh-107868 Add an O(1) fastpath for sum(range(...)) #107870

Closed

picnixz added type-feature A feature request or enhancement and removed 3.11 only security fixes labels Apr 24, 2025

picnixz closed this as completed Apr 24, 2025

picnixz reopened this Apr 25, 2025

sum() several times slower on Python 3 64-bit #68264

sum() several times slower on Python 3 64-bit #68264

Comments

ambv commented Apr 29, 2015

ambv commented Apr 29, 2015

serhiy-storchaka commented Apr 29, 2015

pitrou commented Apr 29, 2015

ambv commented Apr 29, 2015

pitrou commented Apr 29, 2015

mdickinson commented Apr 30, 2015

mdickinson commented Apr 30, 2015

scoder commented Apr 30, 2015

pitrou commented Apr 30, 2015

scoder commented May 1, 2015

pitrou commented May 1, 2015

scoder commented May 1, 2015

stevendaprano commented May 1, 2015

scoder commented May 11, 2015

scoder commented Aug 12, 2018

gvanrossum commented Sep 18, 2021

rhettinger commented Sep 20, 2021

scoder commented Sep 20, 2021

rhettinger commented Sep 20, 2021

gvanrossum commented Sep 20, 2021

scoder commented Sep 21, 2021

gvanrossum commented Sep 21, 2021

scoder commented Sep 21, 2021

serhiy-storchaka commented Sep 21, 2021

scoder commented Sep 21, 2021

serhiy-storchaka commented Sep 21, 2021

pablogsal commented Sep 21, 2021

pablogsal commented Sep 21, 2021

pablogsal commented Sep 21, 2021

pablogsal commented Sep 21, 2021

scoder commented Sep 22, 2021

pablogsal commented Sep 22, 2021

markshannon commented Oct 20, 2022

markshannon commented Oct 20, 2022

markshannon commented Oct 21, 2022

l1t1 commented Oct 26, 2022

l1t1 commented Oct 26, 2022

StanFromIreland commented Apr 24, 2025

skirpichev commented Apr 25, 2025

picnixz commented Apr 25, 2025

skirpichev commented Apr 25, 2025

gvanrossum commented Apr 25, 2025 • edited Loading

picnixz commented Apr 25, 2025 • edited Loading

skirpichev commented Apr 25, 2025

gvanrossum commented Apr 25, 2025 •

edited

Loading

picnixz commented Apr 25, 2025 •

edited

Loading