-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
save/load and tofile/fromfile fail silently for large arrays on Mac os X #2806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this is local to Mac os X and is related to numpy.fromfile failing for large arrays. See this stackoverflow question: I can verify the bug described in stackoverflow case: >>> import numpy
>>> a = numpy.random.randn(300000000)
>>> a.tofile('a.tofile')
>>> b = numpy.fromfile('a.tofile', count=int(8e7))
>>> b
array([-0.57060504, 0.32796127, -1.23472672, ..., 0.28363057,
-1.69623226, 2.36057118])
>>> b = numpy.fromfile('a.tofile', count=int(8e8))
>>> b
array([-0.57060504, 0.32796127, -1.23472672, ..., 0. ,
0. , 0. ])
>>> b = numpy.fromfile('a.tofile', count=int(8e9))
>>> b
array([ 0., 0., 0., ..., 0., 0., 0.])
>>> numpy.fromfile('a.tofile')
array([ 0., 0., 0., ..., 0., 0., 0.]) It looks like it is a bug with Mac os X fread and might need a similar workaround as provided for issue #2256 |
I think I found a way to fix both of the issues. The end of array_fromfile_binary in numpy/core/src/multiarray/ctors.c currently looks like this: NPY_BEGIN_ALLOW_THREADS;
*nread = fread(PyArray_DATA(r), dtype->elsize, num, fp);
NPY_END_ALLOW_THREADS; changing it into fixes both the issues with save and load and the issues with fromfile/tofile. NPY_BEGIN_ALLOW_THREADS;
#if defined(__APPLE__)
/* Workaround for read failures on OS X. Issue #2806 */
{
npy_intp maxsize = 2147483647 / dtype->elsize;
npy_intp chunksize;
size_t n = 0;
size_t n2;
while (num > 0) {
chunksize = (num > maxsize) ? maxsize : num;
n2 = fread((const void *)
((char *)PyArray_DATA(r) + (n * dtype->elsize)),
dtype->elsize,
(size_t) chunksize, fp);
if (n2 < chunksize) {
break;
}
n += n2;
num -= chunksize;
}
*nread = n;
}
#else
*nread = fread(PyArray_DATA(r), dtype->elsize, num, fp);
#endif
NPY_END_ALLOW_THREADS; now about the number, 2147483647: I originally copy-pasted code for the workaround for issue #2256 that uses 2147483648 ( == 2^31) as the maxsize threshold. I found this threshold, 2^31, to provide a workaround for the issue with np.save, but not for the issues with fromfile/tofile. 2^31 -1 fixes it for both bugs though. This worries me as I feel like I am missing something here, because from what I understand, np.load calls fromfile at some point -- it should either work or not for both of them not just for one of them. Would like to hear some thoughts on this. |
I have put a proposed fix along with unit tests, into my fork: |
@sauliusl: it would be cleaner to add a new helper functions |
multiarray/convert.c also has an inline copy of this same loop for fwrite, in an #ifdef _WIN64. This kind of duplication is silly, like @pv says; we should pull these out into wrapper functions, and the wrapper should just use a sensible buffer size that's well-clear of these magic limits -- like 2*25 or something (= 32 megabytes per call). It just needs to be large enough that reading/writing that much data is more expensive than a syscall, and syscalls are pretty cheap. And the loop should be used unconditionally on all systems instead of messing about with #ifdefs, there's no *downside to calling fread/fwrite 10 times instead of 1 when reading/writing hundreds of megabytes, and it's always better to have one well-tested code path instead of many poorly tested ones. (Also please use more descriptive commit messages -- "attempt at fixing the bug" is clear now, but will be pretty puzzling in a few years when someone is looking at 'git log' ;-).) Are you testing Go ahead and submit a pull request and we can sort all this stuff out there. |
as NumPyOS_fwrite to unify the fwrites. Also made multiarray/ctors.c and multiarray/convert.c to use these new functions. see discussion in issue numpy#2806
as NumPyOS_fwrite to unify the fwrites. Also made multiarray/ctors.c and multiarray/convert.c to use these new functions. see discussion in issue numpy#2806
I believe this is fixed in Mavericks, closing as the best fix is an OS X upgrade. Please reopen if the problem is something you think needs to be dealt with. |
@charris - just to note that I also just ran into this. Personally I think that a workaround should be provided in Numpy, as an OS upgrade is a pretty extreme solution. |
Does the bug apply on 10.6? If so, and there is a patch, I think that would be worth merging, because 10.6 is still very much with us; current figures from https://www.adium.im/sparkle/#osVersion gives 10.6 as 14.5 percent of OSX, and this for a messaging app which might well be run on machines with a less conservative upgrade scheme than Macs used for science. |
@astrofrog It's the Apple way ;) However, chunking is not a bad idea in any case, so if someone is motivated to make a PR I'll take it. |
Chuck - how about #2931 ? |
@matthew-brett The old PR languished, it lacked follow through. |
Would you be prepared to accept a de-languished version of the PR? @astrofrog - do you have any interest in working on this? |
Sure. |
@matthew-brett - what remained to be done with respect to #2931? |
IIRC, chunk size. A chunk size of around 256 meg seemed to be optimal for the zip file fixes. See http://nbviewer.ipython.org/gist/btel/5729671. |
Testing was also a problem, but with the smaller chunk size I think one could get away with a smaller testfile. IIRC, the Travis environment gives us 2 GiB. |
@charris - thanks! I won't have time to look into this over the next week, but hopefully will have some time after. |
@charris Depends what you are testing -- the bug only occured for filesizes over 2GB therefore the test doing the massive IO operation to verify that all is fine. |
A test that files larger than the chunk size work would cover 99% of the
|
It seems that saving large numpy arrays to disk using numpy.save fails silently for very large matrices, example:
Note zeros in the end of the matrix that do not exist in a
I'm running numpy version 1.6.2:
on python 2.7.3 on Mac os X 10.8.2 if that helps
The text was updated successfully, but these errors were encountered: