Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: allow NumPy created .npy files to be appended in-place #20321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 5, 2022

Conversation

xor2k
Copy link
Contributor

@xor2k xor2k commented Nov 7, 2021

Dear all, I've added this pull request to add NpyAppendArray and have written to the mailing-list, see

https://mail.python.org/archives/list/[email protected]/thread/57TRDY3ZVMX3DYFHHPQOGWKDVFALSCPQ/

For a summary what NpyAppendArray is, compare

https://github.com/xor2k/npy-append-array

What I have done so far:

  1. Added NpyAppendArray to lib.format
  2. Added a test case to test_format.py
  3. Verified that the test case is actually used by running np.lib.test(verbose=3)
  4. Send an email to the mailing list (see above)

If you like the concept of NpyAppendArray, I will dedicate enough time to meet all numpy quality standards (which I have probably not met yet) and otherwise, if there is not so much interest, I'm prepared that this pull request may be deleted ;)

Anyway, let's try this & best from Berlin, Michael

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Some feedback on the code itself, without answering the question of whether this is likely to end up in numpy (for which the mailing list is the better place for discussion); even if it doesn't end up in numpy, hopefully it's useful feedback you can apply to https://github.com/xor2k/npy-append-array

Comment on lines 930 to 931
def has_fortran_order(arr):
return not arr.flags.c_contiguous and arr.flags.f_contiguous
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just arr.flags.fnc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cool, update on the way.

return self

def __exit__(self, exc_type, exc_val, exc_tb):
del self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't do anything; del just deletes the local reference to self, and won't actually call __del__ here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's bad, I'll also fix this in the npy-append-array module, thanks a lot!

Comment on lines 997 to 1017
if has_fortran_order(arr):
raise NotImplementedError("fortran_order not implemented")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can never fire, since the check above already checks for this,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I've missed that, will also update the npy-append-array module.

Comment on lines 952 to 964
self.is_version_1 = magic[0] == 1 and magic[1] == 0
self.is_version_2 = magic[0] == 2 and magic[1] == 0

if not self.is_version_1 and not self.is_version_2:
raise NotImplementedError(
"version (%d, %d) not implemented" % magic
)

self.header_length, = unpack("<H", peek(fp, 2)) if self.is_version_1 \
else unpack("<I", peek(fp, 4))

self.header = read_array_header_1_0(fp) if \
self.is_version_1 else read_array_header_2_0(fp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this dance, could you just call _read_array_header(fp, version), and then use fp.tell() to work out the header length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, since now I have access to all the np.lib.format functions, I can do this directly, very nice!

new_header_map = header_tuple_dict(self.header)

new_header_bytes = self.__create_header_bytes(new_header_map, True)
header_length = len(self.header_bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use self.header_length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that length should never change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just double checked: header_length should probably not be an attribute of NpyAppendArray but rather a local variable of __init and I should also rename it, update on the way.

Comment on lines 971 to 973
self.header_bytes = fp.read(self.header_length + (
10 if self.is_version_1 else 12
))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are 10 and 12 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the byte offset. The header length is a little-endian unsigned short int, which has two bytes in version 1 and 4 bytes in version 2. This is where the 10 and 12 are from. I can replace this with a calculation like in
https://numpy.org/devdocs/reference/generated/numpy.lib.format.html#format-version-1-0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should disappear in the new version.

@xor2k
Copy link
Contributor Author

xor2k commented Nov 7, 2021

@eric-wieser Can I just push new code and update the pull request or will this break the references in our conversation?

@xor2k
Copy link
Contributor Author

xor2k commented Nov 7, 2021

Just added a new version, heavily simplified, many thanks @eric-wieser!

@xor2k
Copy link
Contributor Author

xor2k commented Nov 21, 2021

Renamed NpyAppendArray to just AppendArray. Thanks @seberg for some inspiration on whether to keep the AppendArray class approach or extend numpy.save with an append argument (compare #11939): keeping the class approach is indeed better, since it allows to persist the head during the append process, so that it can be written when the array has been finished instead of writing it every time data is appended to the array. AppendArray now is also more restrictive to only work if there is spare space in the .npy file header and throw an exception otherwise.

Considering multithreaded read/write: it is not supposed to work. Should not work with numpy.load/numpy.save either. Or am I missing something?

Thanks everybody so far! I would highly appreciate more feedback on next steps for integration in Numpy, since some quality standards most likely are not met yet.

Comment on lines 938 to 944
write_array_header_2_0(io, header_map)

# create array header with 64 byte space space for shape to grow
io.getbuffer()[8:12] = pack("<I", int(io.getbuffer().nbytes-12+64))
io.getbuffer()[-1] = 32
io.write(b" "*64)
io.getbuffer()[-1] = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add an argument to write_array_header_2_0 to encapsulate this, either as:

  • min_header_size, which sets the minimum header padding size
  • extra_header_padding, which augments the header with extra padding.

Copy link
Contributor Author

@xor2k xor2k Nov 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the extra_header_padding is what we need. The question is whether to make it an argument or just add 64 byte by default, since as mentioned, more energy would be needed to populate such an array than would be necessary to boil earth's oceans. In a distant future, when our descendants eventually build a dyson swarm, this value may be increased to 128 byte to account for all particles in the known universe.
Adding 64 byte by default will increase the size of every .npy file by 64 bytes, but I cannot really judge how much of a problem that would be. Maybe none at all, maybe filesystem overhead adds 64 byte anyway, maybe this would pose some issues to certain users that have millions/billions of .npy files laying around. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more energy would be needed to populate such an array than would be necessary to boil earth's oceans. In a distant future, when our descendants eventually build a dyson swarm, this value may be increased to 128 byte to account for all particles in the known universe.

I think you're thinking about this the wrong way. The length of an array is stored in an int64, so we're already limited to 2**63-1. The question is what the longest string is that is less than that size; for instance, (9223372036854775807,) is shorter than (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10). On the other hand, I think the way you have appending set up means that only one dimension of the array can grow, in which case len(str(2**63-1)) - len(str(len(arr))) is the needed amount of padding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps write_array_header_2_0(..., extra_padding=True) and have write_array_header_2_0 deal with calculating how much padding is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have chosen to try the -1 option, see below. Then we would not need to handle possible growing headers in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, the more dimensions and larger dimensions the array has, the more bytes are covered by advancing along the first axis. So the worst case is an array with only one dimension of dtype int8 or so. So len(str(2**63-1)) (=19) would be an upper bound. That means we could also just take 19 bytes as a buffer instead of 64 bytes. I would propose to make the code as simple as possible, so one could just add 19 bytes and not try to reduce that byte count by taking into account the influence of the other axes. For most arrays 19 spare bytes would probably not even make a difference in the file size, although even if, it would probably be consumed by the filesystem overhead.
I'll do it in a follow up pull request, first this pull request here (which does not affect anything else) should go through (as far as I understand it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, shouldn't it be 2^64-1? I mean size_t is a uint64_t in C (for 64 bit machines). This would increase the buffer from 19 to 20 bytes.


self.shape[0] += arr.shape[0]

arr.tofile(self.fp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't quite feel safe to me; you can't guarantee in python that the __del__ or close method is actually called, and you can risk leaving behind a silently truncated file without any record that it got truncated.

Should we have a "dirty" flag in the header, which gets cleared by close() such that the user can at least detect when they open a file that's broken in this way?

Copy link
Contributor Author

@xor2k xor2k Nov 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue used to be around before as well, also in numpy.save: if for some reason the program crashes before the file is finished, it header and content will be in an inconsistent state. This was also an issue in all previous versions of AppendArray but may be more pronounced here.

Did I get you right that you supposed to add a dirty flag to the .npy file format? I was trying to add some functionality without introducing a new numpy file format version. However, I already had some ideas for a follow up pull request, which would allow -1 to be specified as shape[0], so that the size of the array could be inferred from the file size. This would be consistent with e.g. numpy.reshape. Would that eliminate the need for a dirty flag, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here are the options as I see them:

  • Rewrite the header after every append call. This probably slows things down a bit as you end up jumping back and forth in the file. This at least means that as long as append itself is not interrupted, the user knows their data is safe. Perhaps we should profile it.
  • Write -1 to the header while the AppendArray is open, then replace it with the final size when .close() runs normally. This is essentially the dirty flag I was suggesting, without needing any new header space. Old versions of numpy will not be able to open a file that is dirty in this way.
  • Add "dirty": True to the header dictionary while the AppendArray is open, and then remove it again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 2, the -1 as shape[0] of the array I have meant in a different way: the -1 would basically stay there forever, the header would not need to be touched anymore and would also never grow. Having the -1 there indicates that the size of the array is determined by the file size. So when the array is loaded and the actual ndarray constructed, the shape of the ndarray would not contain -1 but the actual size in the ndarray object, while the .npy file header stays the same. This would allow .npy to be the digital version of CSV (just kidding, I mean "binary version of CSV" of course), where data is also simply appended to the file's end. Perfectly suited for binary log files.
There would not be any indication whether a write has succeeded or failed, though. A partial, unsuccessful write could however be potentially identified if the file size (minus header size) is not divisible by the multiplied shape entries (times entry byte size). So potential errors need to be caught elsewhere in the file creation process.
All .npy files could theoretically have their shape[0] of -1, since it should be possible to derive the array size from the file size anyway.
What do you think about this interpretation of option 2?

I cannot speak for fortran_order though. Does anybody have a clue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented option 1, "Rewrite the header after every append call" and then option 2 as described in a later commit. What do you think? I can revert to the option 1 commit if option 2 goes too far, as it slightly changes the .npy format.

Also, I made some assumptions about how the fortran part might work. If someone had a fortran array to actually test AppendArray, that would be great.

Comment on lines 24 to 25
def __init__(self, filename): None
def append(self, arr): None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def func(self): None is not a valid expression, so this will have to be updated:

- class AppendArray:
-     def __init__(self, filename): None
-     def append(self, arr): None
+ import os
+ from numpy.typing import NDArray
+
+ class AppendArray:
+     def __init__(self, filename: str | bytes | os.PathLike[str | bytes]) -> None: ...
+     def append(self, arr: NDArray[Any]) -> None: ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secondly, a number of (public) methods and attributes are still absent from the stub file (__enter__, fp, etc.),
can add those as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great, I have simply forgotten it in a hurry. Looks much nicer (and useful) with type annotations as well. Does it make sense to make fp public? Does that fit with the numpy style? Otherwise I'd simply make everything private but special methods, append and close. How does the modification work technically? Will it change the pull request? Can I pull it back into my forked repository?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've figured out, this is just a comment with diff syntax highlighting. Just changed it in my repo.

@xor2k
Copy link
Contributor Author

xor2k commented Jan 9, 2022

@InessaPawson InessaPawson added the triaged Issue/PR that was discussed in a triage meeting label Aug 24, 2022
@mattip
Copy link
Member

mattip commented Sep 8, 2022

@bsipocz pointed out in the latest triage meeting that #4987 (closed) had an alternative that is not complete and has no tests.

@rossbar
Copy link
Contributor

rossbar commented Sep 8, 2022

Thanks @xor2k - would you be willing to potentially split this into multiple PRs? There seems to be general support for bumping the array standard, but adding the AppendArray class to the public API will require a more thorough look.

If you were to encapsulate the changes related to updating the npyformat version into it's own PR that would decouple the decision-making from expanding the API and I suspect that PR would garner more review/go in quicker.

@xor2k xor2k force-pushed the main branch 2 times, most recently from 453e71f to 6a3f9a4 Compare September 8, 2022 20:24
@xor2k
Copy link
Contributor Author

xor2k commented Sep 8, 2022

I have cleaned up my commits and reverted AppendArray to a state where it offers minimal functionality and tests. Should not interfere with anything else at this point and does not introduce a new .npy version.
Concerning a new .npy file format a new idea came to my mind: maybe we don't need a new format after all even if we wanted to append without rewriting the header: What one could do is simply (and generally) write the header only once AppendArray is finished. If this does not happen, then the file will have trailing data. Then, in numpy.load, an argument can be added like recover_trailing_content. In that case, the size of the array would again be inferred from the file size. This is similar to what @eric-wieser proposed with "dirty": True, just without "dirty": True. Advantages:

  1. No new .npy file format necessary
  2. The file shape could be used for validation purposes, e.g. to make sure a file is somehow complete. Incomplete files regularly occur e.g. when transfering data over a network, so it is a pretty common issue.
  3. If an incomplete file is expected (like a binary log file from a (crashed) program), the data can still be used.
  4. The interface would be straight-forward and simple: no need to distinguish between whether or not the data byte count is divisible by the array shape or not. Also, one would not need to specify whether header writes should happen after every append or not.
  5. Since recover_trailing_content only checks the file size, it would be just as efficient as the shape=(-1, ...) solution I proposed at some point. I would however still call it recover_trailing_content instead of infer_array_size_from_filesize or so as latter is more difficult to understand and also wrong (as some sort of recovery actually happens).

Disadvantages:

  1. No indication whether the array creating program crashed (which would be the case with "dirty": True. However, this can still be added at a later stage).
  2. Users would explicitly need to specify they want to recover trailing content. However, a warning can be issued if the user tries to load a file with trailing content and did not specify recover_trailing_content.

What do you think?

@xor2k
Copy link
Contributor Author

xor2k commented Sep 8, 2022

Also, if we go for the recover_trailing_content solution mentioned above, I would suggest to add the 64 bytes of spare space in the header to every .npy array in the future by default, which would remove one case distinction in the code (whether or not to add spare space). 64 bytes should not be an issue anyway, as the filesystem overhead is probably larger.

@xor2k xor2k requested review from BvB93 and eric-wieser and removed request for eric-wieser and BvB93 September 10, 2022 13:00
@xor2k
Copy link
Contributor Author

xor2k commented Sep 10, 2022

So how do we continue? Right now I have a version of AppendArray which does not affect anything else. I think follow-up pull requests should base on this one (and modify it), so can this be handled with multiple (concurrent) pull requests or am I getting something wrong?

@seberg
Copy link
Member

seberg commented Sep 21, 2022

I am missing the strong enough opinion to actually figure out how to push this forward, myself a bit.

The current state (from my point of view), is:

  1. It would be nice to decide that we want this feature in NumPy proper. If that is certain that clarifies things.
  2. There seem to be two approaches:
    • Insert padding (requires no version increment, but anticipating the array growing).
    • Use -1 and infer size from length, requires adjusting the npy format, but that is easy. I might think you don't realize if the file you did not copy a file fully, but maybe you should be using a hashsum in that case anyway.
      Do we have a clear preference for one yet? Both seem fairly fine, I guess the second makes the code a bit simpler.
  3. Did we decide on whether we want the object orientated approach or whether a functional approach with np.save(..., mode="a") might also work?

@rkern I don't want to snipe you, but do you have any quick opinion on evolving npy or having the feature in NumPy?

@rkern
Copy link
Member

rkern commented Sep 21, 2022

I am mostly -1 on extending the format to allow for implicit lengths. The format has been used in the past as a component in ad hoc file formats that concatenate multiple NPY blocks, so any kind of inference would be wrong in those cases. The input filelike object might well be a non-seekable stream that prevents us from actually inferring the full size without going through an application-specific side-channel.

I'd be happy for the extra header padding to be folded into _wrap_header(), though, for all NPY files. That will allow one to append to NPY files after they are initially created by append-agnostic code that is just using np.save(). I'm comfortable leaving AppendArray to third-party libraries in that case.

@seberg
Copy link
Member

seberg commented Sep 22, 2022

Thanks Robert!

It sounds like there is a clear next step now: Fold in the extra padding, because we can just do that.
If there is a bigger desire to add an incremental writer in NumPy itself that can still be done. Right now it feels like that momentum is missing (although maybe the parties who need it most are just not taking part in the discussion).

@rkern
Copy link
Member

rkern commented Sep 22, 2022

BTW, for the extra padding, just adding 64 bytes in _wrap_header() is going to cause some issues later in the appending process as the append-axis size goes larger. Applied naively, it would just mean that 64 bytes will be added to the larger header size, creating a new header that's larger than the old one. You'll have to handle that case explicitly.

A more advanced approach would be to check the current number of digits in the append-axis size's string representation, and only add enough padding to ensure that 20 digits (len(str(1<<64))) can fit. That will take a little refactoring of _write_array_header() and _wrap_header() since the padding is computed in the latter, but it only gets the string representation of the header. But quite doable.

@xor2k xor2k force-pushed the main branch 2 times, most recently from 20add74 to f22a6ca Compare September 24, 2022 19:36
@@ -431,6 +433,19 @@ def _write_array_header(fp, d, version=None):
header.append("'%s': %s, " % (key, repr(value)))
header.append("}")
header = "".join(header)

# Check if d is actual array header data or just some random dict as e.g.
# in test_large_header. In the first case, add spare space for growth
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather fix test_large_header() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought this might be an option too, this is why I've mentioned it in the comment. I'll have a look into it (next push won't fix it, just minor cleanup).

Copy link
Contributor Author

@xor2k xor2k Sep 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that I understand correctly: fixing test_large_header() implies that _write_array_header() can assume that the array is correct enough which implies it must do the same sanity checks e.g. as _read_array_header(), which would then imply that shape and fortran_order is set (otherwise it would not make sense to modify test_large_header() and _write_array_header() in the first place).
Enabling sanity checks for _write_array_header() would need the tests test_large_header(), test_bad_header() and test_metadata_dtype() to be modified as well. While the first one is straight forward, the second one a little ugly and limits the readability of the test case slightly and could make the test case useless in case of future format modifications, the third one is a little challenging because _write_array_header() cannot produce certain kinds of invalid arrays anymore.
I've pushed with the test_metadata_dtype() failing.
I see three options to resolve the situation:

  1. Revert to the old _write_array_header() variant that only applies spare space if the header dict is nice enough
  2. Fix test_metadata_dtype() in a way that assumes that if a test fails, either writing or reading an array can fail as they share a portion of their code. However, if this is not the case in the future anymore or there is an issue in np.save() this could render the test useless.
  3. Introduce a context from contextlib for _write_array_header() to indicate that sanity checks should be skipped. However, I don't know if this is a thing in Numpy and how to properly implement it.
  4. Introduce a boolean flag that would allow to write invalid Numpy headers. Probably 3 is less ugly.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might just have found a middle ground by skipping the sanity test in _write_header and only assuming that shape and fortran_order are set and modifying tests only so that this works out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be it, what do you think?

@xor2k xor2k force-pushed the main branch 5 times, most recently from 77111db to 1a8bec9 Compare September 24, 2022 22:56
@@ -934,6 +936,26 @@ def test_unicode_field_names(tmpdir):
with assert_warns(UserWarning):
format.write_array(f, arr, version=None)

def test_header_growth_axis():
import io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a from io import BytesIO import at the top level. No need for a local import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


fp = io.BytesIO()
format.write_array_header_1_0(
fp, format.header_data_from_array_1_0(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend constructing test dicts directly instead of grabbing them from a real array. Then you will be more able to test arbitrary lengths of the growth axis, and thus test the important property of the feature introduced here: the invariance of the header length under such changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's is a great idea, just implemented this.

@xor2k
Copy link
Contributor Author

xor2k commented Sep 25, 2022

Help!!! I did not want to close the pull request. Did a force push with the last commit missing 🤦

@xor2k xor2k reopened this Sep 25, 2022
@xor2k
Copy link
Contributor Author

xor2k commented Sep 25, 2022

Ah, got it opened again 😌

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think we can add this. Just some smaller comments. The other question is whether the .npy format NEP text should include a mention of this possibility. However, I don't think it needs to.

@xor2k xor2k force-pushed the main branch 2 times, most recently from 2c7f73f to d0e89d2 Compare September 26, 2022 20:24
@xor2k xor2k changed the title ENH: add functionality NpyAppendArray to numpy.format. ENH: make .npy files appendable Sep 26, 2022
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we have settled and the change is now simple enough.

I do not have an opinion on how much space we add (I don't think it really matters). In practice 19 seems enough, 20 if you look at it unsigned, 21 if you want to be overly conservative :).

If nobody else voices an opinion, I am planning to merge it as-is soon.

@seberg seberg changed the title ENH: make .npy files appendable ENH: allow NumPy created .npy files to be appended in-place Sep 29, 2022
@seberg seberg changed the title ENH: allow NumPy created .npy files to be appended in-place ENH: enable others to append to NumPy .npy files in-place Sep 29, 2022
@xor2k
Copy link
Contributor Author

xor2k commented Sep 29, 2022

Thanks everyone, the solution now looks more thought-through and cleaner than ever before!
I must admit though that I've just figured out I need to access some underscore functions to append to arrays with an external library properly. So maybe I will make some follow up pull requests on this topic at some point in the future, just wanted to warn in advance. I think this pull request is complete though. However, can we rename it back to "NH: allow NumPy created .npy files to be appended in-place"? @seberg

@seberg seberg changed the title ENH: enable others to append to NumPy .npy files in-place ENH: allow NumPy created .npy files to be appended in-place Oct 5, 2022
@seberg
Copy link
Member

seberg commented Oct 5, 2022

Thanks @xor2k, lets put this in. If anyone thinks there should be anything followed up on here, please don't hesitate to ask for a change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement triaged Issue/PR that was discussed in a triage meeting
Projects
Development

Successfully merging this pull request may close these issues.

8 participants