Serialization of sparse arrays to zarr, based on a codec.
Unlike binsparse-python, the different 1D arrays (data, coordinate arrays, compressed coordinate arrays) are stored in a shard-like structure, per chunk.
This does make reading specific parts (e.g. the coordinates) in a single request a bit harder, but having a single logical array map to a on-disk zarr array does have its advantages.
zarr-sparse currently requires a special version of zarr. To install it, use:
pip install \
"zarr @ git+https://github.com/keewis/zarr-python.git@zarr-sparse-patch" \
"zarr-sparse @ git+https://github.com/keewis/zarr-sparse.git@main"from zarr_sparse import SparseArrayCodec
import numpy as np
import sparse
import zarr
def generate_random_coo(nnz, shape, dtype, fill_value):
rng = np.random.default_rng(seed=0)
data = rng.random(size=nnz).astype(dtype)
coords = np.stack([rng.integers(dim_size, size=nnz) for dim_size in shape], axis=0)
return sparse.COO(data=data, coords=coords, shape=shape, fill_value=fill_value)
x = generate_random_coo(
nnz=4500, shape=(4500, 6500), dtype="float64", fill_value=np.nan
)
chunks = (500, 500)
with zarr.storage.MemoryStore() as store:
root = zarr.api.synchronous.create_group(store=store, zarr_format=3)
z = root.create_array(
"a",
data=x,
fill_value=x.fill_value,
write_data=True,
chunks=chunks,
serializer=SparseArrayCodec(),
filters=None,
compressors=None,
dimension_names=["x", "y"],
)
print(z[:])