-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: allow NumPy created .npy files to be appended in-place #20321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
Some feedback on the code itself, without answering the question of whether this is likely to end up in numpy (for which the mailing list is the better place for discussion); even if it doesn't end up in numpy, hopefully it's useful feedback you can apply to https://github.com/xor2k/npy-append-array
numpy/lib/format.py
Outdated
def has_fortran_order(arr): | ||
return not arr.flags.c_contiguous and arr.flags.f_contiguous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just arr.flags.fnc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's cool, update on the way.
numpy/lib/format.py
Outdated
return self | ||
|
||
def __exit__(self, exc_type, exc_val, exc_tb): | ||
del self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't do anything; del
just deletes the local reference to self, and won't actually call __del__
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's bad, I'll also fix this in the npy-append-array module, thanks a lot!
numpy/lib/format.py
Outdated
if has_fortran_order(arr): | ||
raise NotImplementedError("fortran_order not implemented") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can never fire, since the check above already checks for this,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I've missed that, will also update the npy-append-array module.
numpy/lib/format.py
Outdated
self.is_version_1 = magic[0] == 1 and magic[1] == 0 | ||
self.is_version_2 = magic[0] == 2 and magic[1] == 0 | ||
|
||
if not self.is_version_1 and not self.is_version_2: | ||
raise NotImplementedError( | ||
"version (%d, %d) not implemented" % magic | ||
) | ||
|
||
self.header_length, = unpack("<H", peek(fp, 2)) if self.is_version_1 \ | ||
else unpack("<I", peek(fp, 4)) | ||
|
||
self.header = read_array_header_1_0(fp) if \ | ||
self.is_version_1 else read_array_header_2_0(fp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this dance, could you just call _read_array_header(fp, version)
, and then use fp.tell()
to work out the header length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, since now I have access to all the np.lib.format functions, I can do this directly, very nice!
numpy/lib/format.py
Outdated
new_header_map = header_tuple_dict(self.header) | ||
|
||
new_header_bytes = self.__create_header_bytes(new_header_map, True) | ||
header_length = len(self.header_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use self.header_length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that length should never change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just double checked: header_length should probably not be an attribute of NpyAppendArray but rather a local variable of __init and I should also rename it, update on the way.
numpy/lib/format.py
Outdated
self.header_bytes = fp.read(self.header_length + ( | ||
10 if self.is_version_1 else 12 | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are 10 and 12 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the byte offset. The header length is a little-endian unsigned short int, which has two bytes in version 1 and 4 bytes in version 2. This is where the 10 and 12 are from. I can replace this with a calculation like in
https://numpy.org/devdocs/reference/generated/numpy.lib.format.html#format-version-1-0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should disappear in the new version.
@eric-wieser Can I just push new code and update the pull request or will this break the references in our conversation? |
Just added a new version, heavily simplified, many thanks @eric-wieser! |
Renamed NpyAppendArray to just AppendArray. Thanks @seberg for some inspiration on whether to keep the AppendArray class approach or extend numpy.save with an append argument (compare #11939): keeping the class approach is indeed better, since it allows to persist the head during the append process, so that it can be written when the array has been finished instead of writing it every time data is appended to the array. AppendArray now is also more restrictive to only work if there is spare space in the .npy file header and throw an exception otherwise. Considering multithreaded read/write: it is not supposed to work. Should not work with numpy.load/numpy.save either. Or am I missing something? Thanks everybody so far! I would highly appreciate more feedback on next steps for integration in Numpy, since some quality standards most likely are not met yet. |
numpy/lib/format.py
Outdated
write_array_header_2_0(io, header_map) | ||
|
||
# create array header with 64 byte space space for shape to grow | ||
io.getbuffer()[8:12] = pack("<I", int(io.getbuffer().nbytes-12+64)) | ||
io.getbuffer()[-1] = 32 | ||
io.write(b" "*64) | ||
io.getbuffer()[-1] = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add an argument to write_array_header_2_0
to encapsulate this, either as:
min_header_size
, which sets the minimum header padding sizeextra_header_padding
, which augments the header with extra padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the extra_header_padding is what we need. The question is whether to make it an argument or just add 64 byte by default, since as mentioned, more energy would be needed to populate such an array than would be necessary to boil earth's oceans. In a distant future, when our descendants eventually build a dyson swarm, this value may be increased to 128 byte to account for all particles in the known universe.
Adding 64 byte by default will increase the size of every .npy file by 64 bytes, but I cannot really judge how much of a problem that would be. Maybe none at all, maybe filesystem overhead adds 64 byte anyway, maybe this would pose some issues to certain users that have millions/billions of .npy files laying around. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more energy would be needed to populate such an array than would be necessary to boil earth's oceans. In a distant future, when our descendants eventually build a dyson swarm, this value may be increased to 128 byte to account for all particles in the known universe.
I think you're thinking about this the wrong way. The length of an array is stored in an int64, so we're already limited to 2**63-1. The question is what the longest string is that is less than that size; for instance, (9223372036854775807,)
is shorter than (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10)
. On the other hand, I think the way you have appending set up means that only one dimension of the array can grow, in which case len(str(2**63-1)) - len(str(len(arr)))
is the needed amount of padding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So perhaps write_array_header_2_0(..., extra_padding=True)
and have write_array_header_2_0
deal with calculating how much padding is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have chosen to try the -1 option, see below. Then we would not need to handle possible growing headers in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, the more dimensions and larger dimensions the array has, the more bytes are covered by advancing along the first axis. So the worst case is an array with only one dimension of dtype int8 or so. So len(str(2**63-1))
(=19) would be an upper bound. That means we could also just take 19 bytes as a buffer instead of 64 bytes. I would propose to make the code as simple as possible, so one could just add 19 bytes and not try to reduce that byte count by taking into account the influence of the other axes. For most arrays 19 spare bytes would probably not even make a difference in the file size, although even if, it would probably be consumed by the filesystem overhead.
I'll do it in a follow up pull request, first this pull request here (which does not affect anything else) should go through (as far as I understand it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, shouldn't it be 2^64-1? I mean size_t
is a uint64_t
in C (for 64 bit machines). This would increase the buffer from 19 to 20 bytes.
numpy/lib/format.py
Outdated
|
||
self.shape[0] += arr.shape[0] | ||
|
||
arr.tofile(self.fp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't quite feel safe to me; you can't guarantee in python that the __del__
or close
method is actually called, and you can risk leaving behind a silently truncated file without any record that it got truncated.
Should we have a "dirty" flag in the header, which gets cleared by close()
such that the user can at least detect when they open a file that's broken in this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue used to be around before as well, also in numpy.save: if for some reason the program crashes before the file is finished, it header and content will be in an inconsistent state. This was also an issue in all previous versions of AppendArray but may be more pronounced here.
Did I get you right that you supposed to add a dirty flag to the .npy file format? I was trying to add some functionality without introducing a new numpy file format version. However, I already had some ideas for a follow up pull request, which would allow -1 to be specified as shape[0], so that the size of the array could be inferred from the file size. This would be consistent with e.g. numpy.reshape. Would that eliminate the need for a dirty flag, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here are the options as I see them:
- Rewrite the header after every append call. This probably slows things down a bit as you end up jumping back and forth in the file. This at least means that as long as
append
itself is not interrupted, the user knows their data is safe. Perhaps we should profile it. - Write
-1
to the header while theAppendArray
is open, then replace it with the final size when.close()
runs normally. This is essentially the dirty flag I was suggesting, without needing any new header space. Old versions of numpy will not be able to open a file that is dirty in this way. - Add
"dirty": True
to the header dictionary while theAppendArray
is open, and then remove it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 2, the -1 as shape[0] of the array I have meant in a different way: the -1 would basically stay there forever, the header would not need to be touched anymore and would also never grow. Having the -1 there indicates that the size of the array is determined by the file size. So when the array is loaded and the actual ndarray constructed, the shape of the ndarray would not contain -1 but the actual size in the ndarray object, while the .npy file header stays the same. This would allow .npy to be the digital version of CSV (just kidding, I mean "binary version of CSV" of course), where data is also simply appended to the file's end. Perfectly suited for binary log files.
There would not be any indication whether a write has succeeded or failed, though. A partial, unsuccessful write could however be potentially identified if the file size (minus header size) is not divisible by the multiplied shape entries (times entry byte size). So potential errors need to be caught elsewhere in the file creation process.
All .npy files could theoretically have their shape[0] of -1, since it should be possible to derive the array size from the file size anyway.
What do you think about this interpretation of option 2?
I cannot speak for fortran_order though. Does anybody have a clue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented option 1, "Rewrite the header after every append call" and then option 2 as described in a later commit. What do you think? I can revert to the option 1 commit if option 2 goes too far, as it slightly changes the .npy format.
Also, I made some assumptions about how the fortran part might work. If someone had a fortran array to actually test AppendArray
, that would be great.
numpy/lib/format.pyi
Outdated
def __init__(self, filename): None | ||
def append(self, arr): None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def func(self): None
is not a valid expression, so this will have to be updated:
- class AppendArray:
- def __init__(self, filename): None
- def append(self, arr): None
+ import os
+ from numpy.typing import NDArray
+
+ class AppendArray:
+ def __init__(self, filename: str | bytes | os.PathLike[str | bytes]) -> None: ...
+ def append(self, arr: NDArray[Any]) -> None: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Secondly, a number of (public) methods and attributes are still absent from the stub file (__enter__
, fp
, etc.),
can add those as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be great, I have simply forgotten it in a hurry. Looks much nicer (and useful) with type annotations as well. Does it make sense to make fp
public? Does that fit with the numpy style? Otherwise I'd simply make everything private but special methods, append
and close
. How does the modification work technically? Will it change the pull request? Can I pull it back into my forked repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've figured out, this is just a comment with diff syntax highlighting. Just changed it in my repo.
Just send a new mail on the mailing list, see https://mail.python.org/archives/list/[email protected]/thread/A4CJ2DZCAKPMD2MYGVMDV5UI7FN4SBVI/ |
Thanks @xor2k - would you be willing to potentially split this into multiple PRs? There seems to be general support for bumping the array standard, but adding the If you were to encapsulate the changes related to updating the npyformat version into it's own PR that would decouple the decision-making from expanding the API and I suspect that PR would garner more review/go in quicker. |
453e71f
to
6a3f9a4
Compare
I have cleaned up my commits and reverted AppendArray to a state where it offers minimal functionality and tests. Should not interfere with anything else at this point and does not introduce a new .npy version.
Disadvantages:
What do you think? |
Also, if we go for the |
So how do we continue? Right now I have a version of AppendArray which does not affect anything else. I think follow-up pull requests should base on this one (and modify it), so can this be handled with multiple (concurrent) pull requests or am I getting something wrong? |
I am missing the strong enough opinion to actually figure out how to push this forward, myself a bit. The current state (from my point of view), is:
@rkern I don't want to snipe you, but do you have any quick opinion on evolving |
I am mostly -1 on extending the format to allow for implicit lengths. The format has been used in the past as a component in ad hoc file formats that concatenate multiple NPY blocks, so any kind of inference would be wrong in those cases. The input filelike object might well be a non-seekable stream that prevents us from actually inferring the full size without going through an application-specific side-channel. I'd be happy for the extra header padding to be folded into |
Thanks Robert! It sounds like there is a clear next step now: Fold in the extra padding, because we can just do that. |
BTW, for the extra padding, just adding 64 bytes in A more advanced approach would be to check the current number of digits in the append-axis size's string representation, and only add enough padding to ensure that 20 digits ( |
20add74
to
f22a6ca
Compare
numpy/lib/format.py
Outdated
@@ -431,6 +433,19 @@ def _write_array_header(fp, d, version=None): | |||
header.append("'%s': %s, " % (key, repr(value))) | |||
header.append("}") | |||
header = "".join(header) | |||
|
|||
# Check if d is actual array header data or just some random dict as e.g. | |||
# in test_large_header. In the first case, add spare space for growth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather fix test_large_header()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought this might be an option too, this is why I've mentioned it in the comment. I'll have a look into it (next push won't fix it, just minor cleanup).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just that I understand correctly: fixing test_large_header()
implies that _write_array_header()
can assume that the array is correct enough which implies it must do the same sanity checks e.g. as _read_array_header()
, which would then imply that shape
and fortran_order
is set (otherwise it would not make sense to modify test_large_header()
and _write_array_header()
in the first place).
Enabling sanity checks for _write_array_header()
would need the tests test_large_header()
, test_bad_header()
and test_metadata_dtype()
to be modified as well. While the first one is straight forward, the second one a little ugly and limits the readability of the test case slightly and could make the test case useless in case of future format modifications, the third one is a little challenging because _write_array_header()
cannot produce certain kinds of invalid arrays anymore.
I've pushed with the test_metadata_dtype()
failing.
I see three options to resolve the situation:
- Revert to the old
_write_array_header()
variant that only applies spare space if the header dict is nice enough - Fix
test_metadata_dtype()
in a way that assumes that if a test fails, either writing or reading an array can fail as they share a portion of their code. However, if this is not the case in the future anymore or there is an issue innp.save()
this could render the test useless. - Introduce a context from
contextlib
for_write_array_header()
to indicate that sanity checks should be skipped. However, I don't know if this is a thing in Numpy and how to properly implement it. - Introduce a boolean flag that would allow to write invalid Numpy headers. Probably 3 is less ugly.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might just have found a middle ground by skipping the sanity test in _write_header
and only assuming that shape
and fortran_order
are set and modifying tests only so that this works out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be it, what do you think?
77111db
to
1a8bec9
Compare
numpy/lib/tests/test_format.py
Outdated
@@ -934,6 +936,26 @@ def test_unicode_field_names(tmpdir): | |||
with assert_warns(UserWarning): | |||
format.write_array(f, arr, version=None) | |||
|
|||
def test_header_growth_axis(): | |||
import io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a from io import BytesIO
import at the top level. No need for a local import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
numpy/lib/tests/test_format.py
Outdated
|
||
fp = io.BytesIO() | ||
format.write_array_header_1_0( | ||
fp, format.header_data_from_array_1_0(arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend constructing test dicts directly instead of grabbing them from a real array. Then you will be more able to test arbitrary lengths of the growth axis, and thus test the important property of the feature introduced here: the invariance of the header length under such changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's is a great idea, just implemented this.
Help!!! I did not want to close the pull request. Did a force push with the last commit missing 🤦 |
Ah, got it opened again 😌 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think we can add this. Just some smaller comments. The other question is whether the .npy
format NEP text should include a mention of this possibility. However, I don't think it needs to.
2c7f73f
to
d0e89d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we have settled and the change is now simple enough.
I do not have an opinion on how much space we add (I don't think it really matters). In practice 19 seems enough, 20 if you look at it unsigned, 21 if you want to be overly conservative :).
If nobody else voices an opinion, I am planning to merge it as-is soon.
.npy
files in-place
Thanks everyone, the solution now looks more thought-through and cleaner than ever before! |
.npy
files in-place
Thanks @xor2k, lets put this in. If anyone thinks there should be anything followed up on here, please don't hesitate to ask for a change. |
Dear all, I've added this pull request to add NpyAppendArray and have written to the mailing-list, see
https://mail.python.org/archives/list/[email protected]/thread/57TRDY3ZVMX3DYFHHPQOGWKDVFALSCPQ/
For a summary what NpyAppendArray is, compare
https://github.com/xor2k/npy-append-array
What I have done so far:
If you like the concept of NpyAppendArray, I will dedicate enough time to meet all numpy quality standards (which I have probably not met yet) and otherwise, if there is not so much interest, I'm prepared that this pull request may be deleted ;)
Anyway, let's try this & best from Berlin, Michael