A file utility library that provides a unified, simple interface for accessing both local and remote files. This can be used behind other APIs that need to access files agnostic to where they are located.
cached-path requires Python 3.7 or later.
cached-path is available on PyPI. Just run
pip install cached-pathTo install cached-path from source, first clone the repository:
git clone https://github.com/allenai/cached_path.git
cd cached_pathThen run
pip install -e .from cached_path import cached_pathGiven something that might be a URL or local path, cached_path() determines which.
If it's a remote resource, it downloads the file and caches it to the cache directory, and
then returns the path to the cached file. If it's already a local path,
it makes sure the file exists and returns the path.
For URLs, http://, https://, s3:// (AWS S3), gs:// (Google Cloud Storage), and hf:// (HuggingFace Hub) are all supported out-of-the-box.
Optionally beaker:// URLs in the form of beaker://{user_name}/{dataset_name}/{file_path} are supported, which requires beaker-py to be installed.
For example, to download the PyTorch weights for the model epwalsh/bert-xsmall-dummy
on HuggingFace, you could do:
cached_path("hf://epwalsh/bert-xsmall-dummy/pytorch_model.bin")For paths or URLs that point to a tarfile or zipfile, you can also add a path
to a specific file to the url_or_filename preceeded by a "!", and the archive will
be automatically extracted (provided you set extract_archive to True),
returning the local path to the specific file. For example:
cached_path("model.tar.gz!weights.th", extract_archive=True)You can provide custom headers for HTTP requests, which is useful for accessing private resources that require authentication:
# Using an API token for private resources (e.g. Hugging Face)
headers = {"Authorization": f"Bearer {hf_token}"}
cached_path("https://huggingface.co/api/models/private-model/resolve/main/model.bin", headers=headers)This is particularly useful for downloading private files from services like Hugging Face, GitHub, or any other API that uses Bearer token authentication.
By default the cache directory is ~/.cache/cached_path/, however there are several ways to override this setting:
- set the environment variable
CACHED_PATH_CACHE_ROOT, - call
set_cache_dir(), or - set the
cache_dirargument each time you callcached_path().
cached-path is developed and maintained by the AllenNLP team, backed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering. To learn more about who specifically contributed to this codebase, see our contributors page.
cached-path is licensed under Apache 2.0. A full copy of the license can be found on GitHub.