zeitcache

Stupid-fast functional-flavored caching for xarray pipelines

Introduction

zeitcache is a wrapper function for xarray methods that can automatically create and restore precomputed results for those methods, saving computing resources. It is especially useful for the following workflows:

Reproducible scientific computing code, so you can write idiomatic code without having to worry about performance
Rapid development, where there's no time for something more complicated like snakemake or prefect
Improving performance of preexisting code, as zeitcache fits in transparently
Situations where expensive reductions are commonplace in the code

If you have a DataArray, an immutable function to apply to it, and want to cut down on compute in the simplest way possible, zeitcache might be the library for you. It is similar to joblib, but significantly more optimized for scientific computing workflows and takes advantage of some unique advantages of xarray to make that possible.

Utilization

Simply take a call like this:

from zeitcache import zeitcache

dataset = dataset.mean(dims=('lat', 'lon', 'time'))

And rewrite it as this:

@zeitcache("my_dataset")
def reduction_simple(ds):
    return ds.mean(dims=('lat', 'lon', 'time'))

dataset = reduction_simple(dataset)

Just like that, you now have automatic caching. You can also do something more imperative, if that's your style:

def reduction_simple(ds):
    return ds.mean(dims=('lat', 'lon', 'time'))

dataset = zeitforce("my_dataset", dataset, reduction_simple)

Alternatively, if you'd prefer not to do the caching immediately, or want to map functions onto thunks later on (maybe functional programming is more your style), you can use zeitdelay to do that:

dataset_thunk = zeitdelay("my_dataset", dataset)
# some time later
def some_expensive_function(ds):
    ...
result = dataset_thunk(some_expensive_function)

Do note that this makes your code harder to read.

Important: you must remember to give each dataset a unique name, otherwise you risk collision! Also, zeitcache's hashing algorithm doesn't actually check the data itself but rather its structure in order to make a hash. This works if and only if you make each name unique!

Please see the docstrings for more information on how to use each function.

Future Work

These are roughly ordered from most to least important.

Add type hints throughout the code (will look ugly, but useful)
Allow users to pass an alternative hashing function
Ship a not-O(1) hashing function as an alternative
Warn users if there are two datasets with the same name but different hashes in the directory (so they aren't accidentally duping data)
Make the code even lazier internally
Decompress into RAM instead and do ZSTD decompression there to decrease disk writes (need to check amount of free space in memory though)
Add a small command line tool to automatically manage the global cache directory (we should pick one by default and stash different caches per "program", identified by their invocation path)
Statistics on cache hits / misses
Use Trio to asynchronously do compression so that we don't slow down running programs any more than needed
Make Zarr the default internal representation so that partial reads are possible, and then actually add partial read from RAM support
Support more types of compression algorithms for different needs

The Name

In German, "zeit" means time, and "cache" is the same thing as in English. That's what this software usually does: it caches stuff to save you time. A native speaker could also read it as "Zeitkasse", which means something like "time checkout" or "time cash register", and that's fitting too, since the cached data are things you can withdraw from later to save on time (besides, time is money, after all).

License

This code is MIT licensed. Please follow the terms of that license. Also, if you end up using this in published work, please cite it. Even though it's small, attribution helps justify continued development. See CITATION.cff for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src/zeitcache		src/zeitcache
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

zeitcache

Introduction

Utilization

Future Work

The Name

License

About

Uh oh!

Releases

Packages

Languages

License

nsflores1/zeitcache

Folders and files

Latest commit

History

Repository files navigation

zeitcache

Introduction

Utilization

Future Work

The Name

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages