Proposal: New API to replace existing arrays in npz files

Currently, `dump_npz` either destroys all existing arrays in an npz file or append arrays with names that exist in the npz file as duplicated entries.  This is a rather strange semantics, especially when loading individual arrays with `load_npz` doesn't follow those semantics.

A reasonable semantics would be replacing existing arrays.  This requires a context object, with a role similar to `HighFive::File`.

NumPy doesn't support this.  They only overwrite all arrays at once.

I checked out [libzippp](https://github.com/ctabin/libzippp) and [libzip++](https://github.com/leezu/libzippp), neither work with streams. It's primarily because `dump_npy_stream` and libzip are both "push" interfaces (thus, both need to run the main loop), so they cannot work together without writing a special stream class that serves as a pipe so that libzip can [pull](https://libzip.org/documentation/zip_source_function_create.html) data from it...

I think there is a rather simple way to support replacing semantics given a context object.  When an `npz_file` is opened for update, append as usual while keeping the central directory as a data structure in memory.  When closing, write a temporary file with only the up-to-date arrays, finish the file with the central directory entry, do an atomic move to replace the old file. (It's possible to shrink a file in-place, but I guess that would invalidate too much I/O buffer).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: New API to replace existing arrays in npz files #68

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: New API to replace existing arrays in npz files #68

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions