Thanks to visit codestin.com
Credit goes to github.com

Skip to content

~2**32 byte tofile()/fromfile() limit in 64-bit Windows (Trac #1660) #2256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
numpy-gitbot opened this issue Oct 19, 2012 · 17 comments
Closed

Comments

@numpy-gitbot
Copy link

Original ticket http://projects.scipy.org/numpy/ticket/1660 on 2010-11-03 by trac user mspacek, assigned to @charris.

I'm using Christoph Gohlke's amd64 builds (http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy) on 64-bit Windows 7, with the official Python 2.6.6 amd64 install. I can't tofile() or save() or fromfile() numpy arrays roughly > 2**32 bytes. For save() and tofile() Python hangs, using 100% CPU on one core. For fromfile(), it raises an IOError. This doesn't happen in 64-bit Linux on the same machine. load() seems to work on any size of file. Here are some examples, and some caveats:

# if .tofile() hangs, it hangs immediately:
np.zeros(2**32+2**16, np.int8).tofile('test.npy') # hangs, file is 2**16 bytes
np.zeros(2**32+2**2,  np.int8).tofile('test.npy') # hangs, file is 0 bytes
np.zeros(2**32+2,     np.int8).tofile('test.npy') # hangs, file is 0 bytes
np.zeros(2**32+1,     np.int8).tofile('test.npy') # hangs, file is 0 bytes
np.zeros(2**32,       np.int8).tofile('test.npy') # hangs, file is 0 bytes
np.zeros(2**32-1,     np.int8).tofile('test.npy') # works

# if np.save() hangs, it takes a while for it to hang:
np.save('test.npy', np.zeros(2**33, np.int8)) # hangs, file is 2**32 bytes, npy header is intact
np.save('test.npy', np.zeros(2**33-1, np.int8)) # hangs, file is 2**32 bytes, npy header is intact
np.save('test.npy', np.zeros(2**32+2**31, np.int8)) # hangs, file is 2**31 bytes, npy header is intact
np.save('test.npy', np.zeros(2**32+2**16, np.int8)) # hangs, file is 2**16 bytes, npy header is intact
np.save('test.npy', np.zeros(2**32+2**12, np.int8)) # hangs, file is 2**12 bytes, npy header is intact
np.save('test.npy', np.zeros(2**32+2**11, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2**10, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2**8, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2**4, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2**2, np.int8)) # works
np.save('test.npy', np.zeros(2**32+2, np.int8)) # works
np.save('test.npy', np.zeros(2**32+1, np.int8)) # works
np.save('test.npy', np.zeros(2**32, np.int8)) # works

# I also generated these 3 large files successfully in 64-bit Linux...
np.zeros(2**33, np.int8).tofile('test1.npy') # 8 GB
np.zeros(2**32, np.int8).tofile('test2.npy') # 4 GB
np.save('test3.npy', np.zeros(2**33, np.int8)) # 8 GB

# ...and then tried reading them back in 64-bit Win7:
a = np.fromfile('test1.npy', dtype=np.int8) # IOError: could not seek in file
a = np.fromfile('test2.npy', dtype=np.int8) # IOError: could not seek in file
a = np.load('test3.npy') # strangely enough, this works fine, even though it's 8 GB!

I have this problem regardless of which of Christoph's amd64 numpy builds (1.4.1, 1.5.0, 1.5.1RC1, MKL or non-MKL) I use. This happens on both my 64-bit Win7 i7 12 GB machine, and on another machine running 64-bit WinXP. Christoph has also confirmed this on Python 2.5 through 3.1 and says there is nothing special about his builds, and that this will affect all 64-bit Windows distributions of numpy (EPD, ActiveState, etc).

@numpy-gitbot
Copy link
Author

@cgohlke wrote on 2010-11-03

The tofile() call hangs in numpy\core\src\multiarray\convert.c line 84:

            n = fwrite((const void *)self->data,
                    (size_t) self->descr->elsize,
                    (size_t) size, fp);

This seems to be the reason: [http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/7c913001-227e-439b-bf07-54369ba07994 "fwrite issues with large data write"]

@numpy-gitbot
Copy link
Author

@cgohlke wrote on 2010-11-03

Probably the same issue as [http://bugs.python.org/issue9015].

@numpy-gitbot
Copy link
Author

trac user mspacek wrote on 2010-11-03

Well, that's depressing. I should just switch to linux already. It looks like fseek and ftell are 32 bit in msvc, even in Win64. Perhaps _fseeki64 and _ftelli64 need to be used instead?:

http://msdn.microsoft.com/en-us/library/75yw9bf3%28v=VS.90%29.aspx

http://msdn.microsoft.com/en-us/library/0ys3hc0b%28v=VS.90%29.aspx

http://www.firstobject.com/fseeki64-ftelli64-in-vc++.htm

This might explain why "IOError: could not seek in file" is thrown in /numpy/core/src/multiarray/ctors.c line 3037 when I call fromfile():

static PyArrayObject *
array_fromfile_binary(FILE *fp, PyArray_Descr *dtype, intp num, size_t *nread)
{
    PyArrayObject *r;
    intp start, numbytes;

    if (num < 0) {
        int fail = 0;

        start = (intp )ftell(fp);
        if (start < 0) {
            fail = 1;
        }
        if (fseek(fp, 0, SEEK_END) < 0) {
            fail = 1;
        }
        numbytes = (intp) ftell(fp);
        if (numbytes < 0) {
            fail = 1;
        }
        numbytes -= start;
        if (fseek(fp, start, SEEK_SET) < 0) {
            fail = 1;
        }
        if (fail) {
            PyErr_SetString(PyExc_IOError,
                            "could not seek in file");
            Py_DECREF(dtype);
            return NULL;
        }
        num = numbytes / dtype->elsize;
    }
    r = (PyArrayObject *)PyArray_NewFromDescr(&PyArray_Type,
                                              dtype,
                                              1, &num,
                                              NULL, NULL,
                                              0, NULL);
    if (r == NULL) {
        return NULL;
    }
    NPY_BEGIN_ALLOW_THREADS;
    *nread = fread(r->data, dtype->elsize, num, fp);
    NPY_END_ALLOW_THREADS;
    return r;
}

However, I don't understand why the error comes up when I call fromfile() directly, but not when I call it indirectly via np.load()

@numpy-gitbot
Copy link
Author

trac user mspacek wrote on 2010-11-03

As a workaround for the tofile() problem, perhaps it could check if the platform is windows, and if so, call fwrite multiple times in say 2GB chunks until all data is written out.

Unfortunately, I don't have the skills to compile on win64, let alone implement something like this.

@numpy-gitbot
Copy link
Author

@cgohlke wrote on 2010-11-03

Rather than getting depressed or dumping Windows, how about you write/adjust the unit tests and I'll try to fix the code to pass the tests.

@numpy-gitbot
Copy link
Author

@cgohlke wrote on 2010-11-03

Please consider the attached patch for numpy 1.5.x. The *i64 functions are not available in the default compiler used by Python 2.5 64 bit. This simple test now passes but may take several minutes to complete (not appropriate for a standard unit test):

import numpy as np

length = 2**29 + 1
fname = 'test.npy'
dtype = np.dtype('int64')

a = np.arange(length, dtype=dtype)
a.tofile(fname)
del a
a = np.fromfile(fname, dtype=dtype)
assert a[-1] == len(a)-1

np.save(fname, a)
del a
a = np.load(fname)
assert a[-1] == len(a)-1
del a
np.save(fname, [])  # todo: delete file

@numpy-gitbot
Copy link
Author

Milestone changed to 1.5.1 by @rgommers on 2010-11-04

@numpy-gitbot
Copy link
Author

Attachment added by @cgohlke on 2010-11-04: ticket1660.diff

@numpy-gitbot
Copy link
Author

trac user mspacek wrote on 2010-11-04

I've tried out Christoph's patch, as provided in his latest binary at http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy, and it all seems to work now! Here are the tests I ran. They do take a fair while to run. 6GB seems a reasonable trade-off between speed and ensuring there's no size limit:

SIXGB = 2**32 + 2**31

np.zeros(SIXGB, np.int8).tofile('test.npy')
a = np.fromfile('test.npy', dtype=np.int8)
assert len(a) == SIXGB

np.save('test.npy', np.ones(SIXGB, np.int8))
a = np.load('test.npy')
assert len(a) == SIXGB
assert (a == 1).all() # this doubles memory usage temporarily
assert a.sum(dtype=np.int64) == SIXGB

I've never written unit tests before. I'll see what I can learn and try and submit something to the ticket.

@numpy-gitbot
Copy link
Author

trac user mspacek wrote on 2010-11-04

Also, I think I've figured out why the IOError was being raised during the direct call to np.fromfile(), but not during the indirect call via np.load(). fromfile's count arg defaults to -1, specifying all entries, which drops it into the loop in /numpy/core/src/multiarray/ctors.c line ~3037. There, it explicitly does some normal 32 bit seek calls, and those were failing. In np.load(), fromfile is called with count set explicitly to something positive, which means all the seeking stuff is skipped, and it goes clean through to the fread call without any problems.

@numpy-gitbot
Copy link
Author

trac user mspacek wrote on 2010-11-04

Here's a unit test patch. With this, np.test('full') hangs without Christoph's patch, and completes successfully with his patch. Unfortunately, the test takes > 1 min to write slightly more than 4GB (the required threshold) to a non-SSD drive. Don't think there's much that can be done about that.

@numpy-gitbot
Copy link
Author

Attachment added by trac user mspacek on 2010-11-04: 0001-add-unittest-for-ticket-1660.patch

@numpy-gitbot
Copy link
Author

Attachment added by trac user mspacek on 2010-11-04: test_big_binary.py

@numpy-gitbot
Copy link
Author

@rgommers wrote on 2010-11-04

About the unit test: you applied the slow decorator, that's all you can do probably. Or maybe we need @dec.very_slow ...

A few other points:

  • tempfile.mktemp() is deprecated, use mkstemp() or NamedTemporaryFile().

  • don't use assert but assert_() from numpy.testing. The builtin assert gets stripped when byte-compiling with -O.

  • don't use capitalized variable names.

  • use separate lines for try: .., like:

    except MemoryError:
    pass
    try:
    os.unlink(fname)
    except WindowsError:
    pass

@numpy-gitbot
Copy link
Author

Attachment added by trac user mspacek on 2010-11-04: add-unittest-for-ticket-1660.patch

@numpy-gitbot
Copy link
Author

@charris wrote on 2011-01-23

Code style needs some fixes, which I'll do.

@numpy-gitbot
Copy link
Author

@charris wrote on 2011-01-23

Fixed in 0baf0ec and 54b7a0d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant