~2**32 byte tofile()/fromfile() limit in 64-bit Windows (Trac #1660) #2256

numpy-gitbot · 2012-10-19T22:19:38Z

Original ticket http://projects.scipy.org/numpy/ticket/1660 on 2010-11-03 by trac user mspacek, assigned to @charris.

I'm using Christoph Gohlke's amd64 builds (http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy) on 64-bit Windows 7, with the official Python 2.6.6 amd64 install. I can't tofile() or save() or fromfile() numpy arrays roughly > 2**32 bytes. For save() and tofile() Python hangs, using 100% CPU on one core. For fromfile(), it raises an IOError. This doesn't happen in 64-bit Linux on the same machine. load() seems to work on any size of file. Here are some examples, and some caveats:

# if .tofile() hangs, it hangs immediately:
np.zeros(2**32+2**16, np.int8).tofile('test.npy') # hangs, file is 2**16 bytes
np.zeros(2**32+2**2,  np.int8).tofile('test.npy') # hangs, file is 0 bytes
np.zeros(2**32+2,     np.int8).tofile('test.npy') # hangs, file is 0 bytes
np.zeros(2**32+1,     np.int8).tofile('test.npy') # hangs, file is 0 bytes
np.zeros(2**32,       np.int8).tofile('test.npy') # hangs, file is 0 bytes
np.zeros(2**32-1,     np.int8).tofile('test.npy') # works

# if np.save() hangs, it takes a while for it to hang:
np.save('test.npy', np.zeros(2**33, np.int8)) # hangs, file is 2**32 bytes, npy header is intact
np.save('test.npy', np.zeros(2**33-1, np.int8)) # hangs, file is 2**32 bytes, npy header is intact
np.save('test.npy', np.zeros(2**32+2**31, np.int8)) # hangs, file is 2**31 bytes, npy header is intact
np.save('test.npy', np.zeros(2**32+2**16, np.int8)) # hangs, file is 2**16 bytes, npy header is intact
np.save('test.npy', np.zeros(2**32+2**12, np.int8)) # hangs, file is 2**12 bytes, npy header is intact
np.save('test.npy', np.zeros(2**32+2**11, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2**10, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2**8, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2**4, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2**2, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2, np.int8)) # works
np.save('test.npy', np.zeros(2**32+1, np.int8)) # works
np.save('test.npy', np.zeros(2**32, np.int8)) # works

# I also generated these 3 large files successfully in 64-bit Linux...
np.zeros(2**33, np.int8).tofile('test1.npy') # 8 GB
np.zeros(2**32, np.int8).tofile('test2.npy') # 4 GB
np.save('test3.npy', np.zeros(2**33, np.int8)) # 8 GB

# ...and then tried reading them back in 64-bit Win7:
a = np.fromfile('test1.npy', dtype=np.int8) # IOError: could not seek in file
a = np.fromfile('test2.npy', dtype=np.int8) # IOError: could not seek in file
a = np.load('test3.npy') # strangely enough, this works fine, even though it's 8 GB!

I have this problem regardless of which of Christoph's amd64 numpy builds (1.4.1, 1.5.0, 1.5.1RC1, MKL or non-MKL) I use. This happens on both my 64-bit Win7 i7 12 GB machine, and on another machine running 64-bit WinXP. Christoph has also confirmed this on Python 2.5 through 3.1 and says there is nothing special about his builds, and that this will affect all 64-bit Windows distributions of numpy (EPD, ActiveState, etc).

The text was updated successfully, but these errors were encountered:

numpy-gitbot · 2012-10-19T22:19:38Z

@cgohlke wrote on 2010-11-03

The tofile() call hangs in numpy\core\src\multiarray\convert.c line 84:

            n = fwrite((const void *)self->data,
                    (size_t) self->descr->elsize,
                    (size_t) size, fp);

This seems to be the reason: [http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/7c913001-227e-439b-bf07-54369ba07994 "fwrite issues with large data write"]

numpy-gitbot · 2012-10-19T22:19:38Z

@cgohlke wrote on 2010-11-03

Probably the same issue as [http://bugs.python.org/issue9015].

numpy-gitbot · 2012-10-19T22:19:39Z

trac user mspacek wrote on 2010-11-03

Well, that's depressing. I should just switch to linux already. It looks like fseek and ftell are 32 bit in msvc, even in Win64. Perhaps _fseeki64 and _ftelli64 need to be used instead?:

http://msdn.microsoft.com/en-us/library/75yw9bf3%28v=VS.90%29.aspx

http://msdn.microsoft.com/en-us/library/0ys3hc0b%28v=VS.90%29.aspx

http://www.firstobject.com/fseeki64-ftelli64-in-vc++.htm

This might explain why "IOError: could not seek in file" is thrown in /numpy/core/src/multiarray/ctors.c line 3037 when I call fromfile():

static PyArrayObject *
array_fromfile_binary(FILE *fp, PyArray_Descr *dtype, intp num, size_t *nread)
{
    PyArrayObject *r;
    intp start, numbytes;

    if (num < 0) {
        int fail = 0;

        start = (intp )ftell(fp);
        if (start < 0) {
            fail = 1;
        }
        if (fseek(fp, 0, SEEK_END) < 0) {
            fail = 1;
        }
        numbytes = (intp) ftell(fp);
        if (numbytes < 0) {
            fail = 1;
        }
        numbytes -= start;
        if (fseek(fp, start, SEEK_SET) < 0) {
            fail = 1;
        }
        if (fail) {
            PyErr_SetString(PyExc_IOError,
                            "could not seek in file");
            Py_DECREF(dtype);
            return NULL;
        }
        num = numbytes / dtype->elsize;
    }
    r = (PyArrayObject *)PyArray_NewFromDescr(&PyArray_Type,
                                              dtype,
                                              1, &num,
                                              NULL, NULL,
                                              0, NULL);
    if (r == NULL) {
        return NULL;
    }
    NPY_BEGIN_ALLOW_THREADS;
    *nread = fread(r->data, dtype->elsize, num, fp);
    NPY_END_ALLOW_THREADS;
    return r;
}

However, I don't understand why the error comes up when I call fromfile() directly, but not when I call it indirectly via np.load()

numpy-gitbot · 2012-10-19T22:19:39Z

trac user mspacek wrote on 2010-11-03

As a workaround for the tofile() problem, perhaps it could check if the platform is windows, and if so, call fwrite multiple times in say 2GB chunks until all data is written out.

Unfortunately, I don't have the skills to compile on win64, let alone implement something like this.

numpy-gitbot · 2012-10-19T22:19:39Z

@cgohlke wrote on 2010-11-03

Rather than getting depressed or dumping Windows, how about you write/adjust the unit tests and I'll try to fix the code to pass the tests.

numpy-gitbot · 2012-10-19T22:19:39Z

@cgohlke wrote on 2010-11-03

Please consider the attached patch for numpy 1.5.x. The *i64 functions are not available in the default compiler used by Python 2.5 64 bit. This simple test now passes but may take several minutes to complete (not appropriate for a standard unit test):

import numpy as np

length = 2**29 + 1
fname = 'test.npy'
dtype = np.dtype('int64')

a = np.arange(length, dtype=dtype)
a.tofile(fname)
del a
a = np.fromfile(fname, dtype=dtype)
assert a[-1] == len(a)-1

np.save(fname, a)
del a
a = np.load(fname)
assert a[-1] == len(a)-1
del a
np.save(fname, [])  # todo: delete file

numpy-gitbot · 2012-10-19T22:19:40Z

Milestone changed to 1.5.1 by @rgommers on 2010-11-04

numpy-gitbot · 2012-10-19T22:19:40Z

Attachment added by @cgohlke on 2010-11-04: ticket1660.diff

numpy-gitbot · 2012-10-19T22:19:40Z

trac user mspacek wrote on 2010-11-04

I've tried out Christoph's patch, as provided in his latest binary at http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy, and it all seems to work now! Here are the tests I ran. They do take a fair while to run. 6GB seems a reasonable trade-off between speed and ensuring there's no size limit:

SIXGB = 2**32 + 2**31

np.zeros(SIXGB, np.int8).tofile('test.npy')
a = np.fromfile('test.npy', dtype=np.int8)
assert len(a) == SIXGB

np.save('test.npy', np.ones(SIXGB, np.int8))
a = np.load('test.npy')
assert len(a) == SIXGB
assert (a == 1).all() # this doubles memory usage temporarily
assert a.sum(dtype=np.int64) == SIXGB

I've never written unit tests before. I'll see what I can learn and try and submit something to the ticket.

numpy-gitbot · 2012-10-19T22:19:40Z

trac user mspacek wrote on 2010-11-04

Also, I think I've figured out why the IOError was being raised during the direct call to np.fromfile(), but not during the indirect call via np.load(). fromfile's count arg defaults to -1, specifying all entries, which drops it into the loop in /numpy/core/src/multiarray/ctors.c line ~3037. There, it explicitly does some normal 32 bit seek calls, and those were failing. In np.load(), fromfile is called with count set explicitly to something positive, which means all the seeking stuff is skipped, and it goes clean through to the fread call without any problems.

numpy-gitbot · 2012-10-19T22:19:40Z

trac user mspacek wrote on 2010-11-04

Here's a unit test patch. With this, np.test('full') hangs without Christoph's patch, and completes successfully with his patch. Unfortunately, the test takes > 1 min to write slightly more than 4GB (the required threshold) to a non-SSD drive. Don't think there's much that can be done about that.

numpy-gitbot · 2012-10-19T22:19:41Z

Attachment added by trac user mspacek on 2010-11-04: 0001-add-unittest-for-ticket-1660.patch

numpy-gitbot · 2012-10-19T22:19:41Z

Attachment added by trac user mspacek on 2010-11-04: test_big_binary.py

numpy-gitbot · 2012-10-19T22:19:41Z

@rgommers wrote on 2010-11-04

About the unit test: you applied the slow decorator, that's all you can do probably. Or maybe we need @dec.very_slow ...

A few other points:

tempfile.mktemp() is deprecated, use mkstemp() or NamedTemporaryFile().
don't use assert but assert_() from numpy.testing. The builtin assert gets stripped when byte-compiling with -O.
don't use capitalized variable names.
use separate lines for try: .., like:

except MemoryError:
pass
try:
os.unlink(fname)
except WindowsError:
pass

numpy-gitbot · 2012-10-19T22:19:41Z

Attachment added by trac user mspacek on 2010-11-04: add-unittest-for-ticket-1660.patch

numpy-gitbot · 2012-10-19T22:19:42Z

@charris wrote on 2011-01-23

Code style needs some fixes, which I'll do.

numpy-gitbot · 2012-10-19T22:19:42Z

@charris wrote on 2011-01-23

Fixed in 0baf0ec and 54b7a0d.

Addresses numpy#2256

numpy-gitbot closed this as completed Oct 19, 2012

numpy-gitbot mentioned this issue Oct 23, 2012

tofile() truncation on arrays >= 2**32 on 64-bit OSX (Trac #2114) #574

Closed

lukauskas mentioned this issue Dec 30, 2012

save/load and tofile/fromfile fail silently for large arrays on Mac os X #2806

Closed

cbrt64 added a commit to cbrt64/numpy that referenced this issue Mar 30, 2023

BUG: Use 2GiB chunking code for fwrite() on mingw32/64

1d123ae

Addresses numpy#2256

cbrt64 added a commit to cbrt64/numpy that referenced this issue Mar 31, 2023

BUG: Use 2GiB chunking code for fwrite() on mingw32/64

4042674

Addresses numpy#2256

cbrt64 mentioned this issue Mar 31, 2023

BUG: Use 2GiB chunking code for fwrite() on mingw32/64 #23505

Merged

Uh oh!

~2**32 byte tofile()/fromfile() limit in 64-bit Windows (Trac #1660) #2256

~2**32 byte tofile()/fromfile() limit in 64-bit Windows (Trac #1660) #2256

Comments

numpy-gitbot commented Oct 19, 2012

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!

numpy-gitbot commented Oct 19, 2012

Uh oh!