-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Pooch -- on the fly download of datasets from github #3945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @hmaarrfk! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-03-12 12:46:07 UTC |
pooch requires requests do we really want to depend on the whole http stack? |
is the plan to have pooch as an optional dependency or a required one? |
Not too sure, I think optional dependency is nice for this case. I think it really depends on how complicated we want to make "installing scikit-image" Our optional requirements are all very heavy. Would we start to recommend everybody install
|
skimage/data/__init__.py
Outdated
@@ -32,18 +37,110 @@ | |||
'horse', | |||
'hubble_deep_field', | |||
'immunohistochemistry', | |||
'lbp_frontal_face_cascade_filename', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this ever meant to be a public image? I kinda removed this from the API. Seemed to be important for tests only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it appear in the data
docs? If so we should definitely keep it. Even if not we probably want to keep it since it's been public for a while...
skimage/data/__init__.py
Outdated
'lfw_subset', | ||
'logo', | ||
'microaneurysms', | ||
'moon', | ||
'page', | ||
'prefetch', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new function added to the API. Probably needs thinking and discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, prefetch
is a bit jargony, my suggestion would be download_all
skimage/data/__init__.py
Outdated
'lfw_subset', | ||
'logo', | ||
'microaneurysms', | ||
'moon', | ||
'page', | ||
'prefetch', | ||
'quantitative_phase_cells', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2MB data set.
skimage/data/__init__.py
Outdated
|
||
|
||
def quantitative_phase_cells(): | ||
"""Image of two cells retrieved from a digital hologram. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring definitely needs work, but I wanted to get the framework down and find the issue with the framework, rather than format docstrings for something that may never happen.
version = __version__ | ||
|
||
# Create a new friend to manage your sample data storage | ||
image_fetcher = pooch.create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be prefixed with an _
to indicate it is internal???
skimage/data/__init__.py
Outdated
path=pooch.os_cache("scikit-image"), | ||
base_url=base_url, | ||
version=version, | ||
env="SCIKIT_IMAGE_DATA_DIR", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allow developers to use:
export SCIKIT_IMAGE_DATA_DIR=/home/mark2/git/scikit-image/data
nto to fetch online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely SKIMAGE_DATADIR
or similar, the current env value is way too verbose (imho)
skimage/data/__init__.py
Outdated
} | ||
) | ||
|
||
fetch = image_fetcher.fetch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shortcut was really useful in migrating all the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
skimage/data/__init__.py
Outdated
|
||
from ..io import imread, use_plugin | ||
from .._shared._warnings import expected_warnings, warn | ||
from ..util.dtype import img_as_bool | ||
from ._binary_blobs import binary_blobs | ||
from ._detect import lbp_frontal_face_cascade_filename | ||
|
||
from .. import __version__ | ||
|
||
import os.path as osp | ||
data_dir = osp.abspath(osp.dirname(__file__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_dir is basically useless now.
The only migration path I can see, is to create some really hacky class, that implements the
class _data_dir:
def __init__(self, image_fetcher):
self.image_fetcher = image_fetcher
def __str__(self):
warn(....)
prefetch()
return str(self.image_fetcher.path)
def __add__(self, filename):
if filename in self.image_fetcher.registry:
warn(.....)
return self.image_fetcher.fetch(filename)
else: # Globbing???
prefetch()
return str(self.image_fetcher.path) + filename
data_dir = _data_dir(image_fetcher)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it useless? You think we shouldn't write to that directory ever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the usefulness of datadir has been restored
skimage/io/tests/test_tifffile.py
Outdated
expected = np.load(fetch('test/chessboard_GRAY_U8.npy')) | ||
with open(fetch('test/chessboard_GRAY_U16.tif'), 'rb') as fh: | ||
expected = np.load(fetch('tests/chessboard_GRAY_U8.npy')) | ||
with open(fetch('tests/chessboard_GRAY_U16.tif'), 'rb') as fh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the renamed directory? So much line churn for an s
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the discrepancy with the regular tree would have driven me madd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially since I would have been the one that introduced it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmaarrfk thanks for making this happen! I agree that it's much easier to comment on a PR than in the abstract.
My main concern is with he existing datadir. I think probably the best thing to do is to deprecate it over several releases. Whether we do this with a hacky class or an import hook is a different question.
One thing I'm worried about is having an os-wide data dir, which means that multiple versions of skimage would share a data directory. I guess pooch can handle it if the files change between versions (re-download if the hashes don't match), but we should worry about putting something in the user guide to avoid user confusion here. OR, we can put the version string in the default directory to avoid such conflicts.
With regard to your comment about now depending on requests: it's indeed concerning, and probably the most concerning part of this approach. However, my feeling is that if we don't use pooch we will end up reengineering something much the same, but with the base http libraries, which would be even more painful. But, I'm certainly open to being convinced that it is a better approach for us. @scikit-image/core thoughts on this?
requirements/default.txt
Outdated
@@ -5,3 +5,4 @@ networkx>=2.0 | |||
pillow>=4.3.0 | |||
imageio>=2.0.1 | |||
PyWavelets>=0.4.0 | |||
pooch>=0.2.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go into optional
or a new category, recommended
...
skimage/data/__init__.py
Outdated
|
||
from ..io import imread, use_plugin | ||
from .._shared._warnings import expected_warnings, warn | ||
from ..util.dtype import img_as_bool | ||
from ._binary_blobs import binary_blobs | ||
from ._detect import lbp_frontal_face_cascade_filename | ||
|
||
from .. import __version__ | ||
|
||
import os.path as osp | ||
data_dir = osp.abspath(osp.dirname(__file__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it useless? You think we shouldn't write to that directory ever?
skimage/data/__init__.py
Outdated
@@ -32,18 +37,110 @@ | |||
'horse', | |||
'hubble_deep_field', | |||
'immunohistochemistry', | |||
'lbp_frontal_face_cascade_filename', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it appear in the data
docs? If so we should definitely keep it. Even if not we probably want to keep it since it's been public for a while...
skimage/data/__init__.py
Outdated
'lfw_subset', | ||
'logo', | ||
'microaneurysms', | ||
'moon', | ||
'page', | ||
'prefetch', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, prefetch
is a bit jargony, my suggestion would be download_all
skimage/data/__init__.py
Outdated
'text', | ||
'retina', | ||
'rocket', | ||
'stereo_motorcycle'] | ||
|
||
|
||
# Pooch expects a `+` to exist in development versions. | ||
# Since scikit-image doesn't follow that convetion, we have to manually provide | ||
# it with the URL and set the version to None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Should we start to use that convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR.
skimage/data/__init__.py
Outdated
fetch = image_fetcher.fetch | ||
|
||
|
||
def prefetch(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest download_all
, and also a directory=SKIMAGE_DATA_DIR
keyword argument, so that anyone can use this function without knowing about the magical environment variable.
|
||
# Create a new friend to manage your sample data storage | ||
image_fetcher = pooch.create( | ||
path=pooch.os_cache("scikit-image"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is pooch.os_cache
by default? Could you add it as a comment?
@@ -564,7 +564,7 @@ def test_is_low_contrast(): | |||
# ======================= | |||
|
|||
def test_dask_histogram(): | |||
pytest.importorskip('dask', reason="dask python library is not installed") | |||
pytest.importorskip('dask') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same change here.
skimage/io/_plugins/fits_plugin.py
Outdated
@@ -41,7 +41,7 @@ def imread(fname, dtype=None): | |||
lazy loading) to get all the extensions at once. | |||
|
|||
""" | |||
if 'dtype' is not None: | |||
if dtype is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. LOL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your fixup in a separate PR, just saw that.
skimage/io/collection.py
Outdated
|
||
>>> coll = io.ImageCollection(data_dir + '/chess*.png') | ||
>>> coll = io.ImageCollection([fetch('chessboard_GRAY.png'), fetch('coins.png')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is a problem, and maybe the hacky class is needed, if pooch can't handle glob expansions. Which would make sense.
Yeah. Originally, I wanted to keep this PR small and a way to stop the data from growing, but then I figured it wouldn't be hard to see where it goes to see all the ramifications of the switch. I'm good to reverting to a slower release too. We should have a mechanism to experiment with different technologies.
In relation to datadir, by using pooch, the user is nolongwr interacting with the directory itself and operating on specific files, but they have to use the pooch object. I'm not sure it behaves like a full fledged directory, if it did, it would be really cool, in thinking like Google drive stream. By using pooch, we are really make the data available, and taking away the directory structure. |
This is worth investigating in-depth. My hope is that using |
On June 2, 2019 23:39:41 Juan Nunez-Iglesias ***@***.***> wrote:
In relation to datadir, by using pooch, the user is nolongwr interacting
with the directory itself and operating on specific files, but they have to
use the pooch object. I'm not sure it behaves like a full fledged directory,
This is worth investigating in-depth. My hope is that using download_all
will mean that users can interact with some directory on their hard drive
as normal, if they so desire. (This also could minimize the impact on our
tests, as we could run download-all before any testing, and then be fine as
long as we redirect data_dir.
I strongly support this point: files, real files with known locations. No
weird objects, just files read as NumPy arrays.
|
Pooch claims to do this on a oer version level. Should be nice.
Our tests should be able to utilize the environment variable to avoid the need for any downloading. If we do want to keep the ability to interact with datadir as before, it might be worth developing a hacky class ourselves, or working with pooch on this issue. The one thing I will warm about is pooch dropped 3.5 support. So we would either have to version them for a bit, at least until we drop 3.5 |
@leouieda How hard would it be to patch Pooch to be 3.5 compatible again? |
@stefanv not too hard. There isn't really any patching that needed to the code itself. I just checked on my machine and all tests still pass on 3.5. It's mainly that conda-forge stopped building 3.5 packages so we figured we could drop it. Do you have a timeline for scikit-image dropping 3.5? We could add it back temporarily and drop 3.5 support later on when scikit-image is ready to do so. The main challenge is keeping the CI running 3.5 but we could probably manage that without much hassle. Nowhere near the pain that is keeping 2.7 happy. |
This should be possible without any changes. Pooch maintains a directory with the downloaded files (see the
By providing |
Cool thanks!
Unfortunately, we don't include a I think the current plan is to ship the datasets in the repo. It would be nice to also provide a secondary local cache in case users cloned the whole git repo. |
For sure. Our current code is very simplistic and could use some upgrading: https://github.com/fatiando/pooch/blob/master/pooch/utils.py#L78
Do you mean to be able to have released versions use one local cache folder (e.g. If so, this would probably be a bit more complicated refactor since the current implementation doesn't really know much about versions beyond appending them to the directory. I see two options that would work right now:
|
Cool, we can discuss it in Pooch's issue tracker. I think the current logic will work for us, for now, but it would be nice to be able to move it into Pooch.
It wouldn't even need to be that complicated:
|
@stefanv @jni I don't see why this is important We want to teach people to analyze their data, not load our sample images with fancy reg-exps and file manipulations. None of the examples used the Data storage and organization seems beyond the goals of scikit-image. I think we should encourage people to use the functions for our sample data, enabling the addition of docstrings, online storage, and caching of data in a straightforward manner. I will very likely submit an other PR cleaning up the data used in the example to add functions with docsrings for all the used data. Once that is in, the examples will no longer even mention |
@hmaarrfk You wrote: "they have to use the pooch object". I don't want users to even have to know about Pooch. They should just know that they can either download the images to disk, or get them through our |
@@ -191,7 +190,7 @@ def plot_comparison(original, filtered, filter_name): | |||
#*single-pixel wide skeleton*. It is important to note that this is | |||
#performed on binary images only. | |||
|
|||
horse = io.imread(os.path.join(data_dir, "horse.png"), as_gray=True) | |||
horse = horse() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be preferable to have variable names != functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely, but we should establish the bigger picture problems first.
@alexdesiqueira I briefly suggested the Open Science Framework (osf.io), but @hmaarrfk wasn't too happy about their URLs. Another option is to make a separate repo + gh-pages service with cloudflare CDN. I have this set up for my blog (don't ask me how 😂), and it seems to work pretty well. That is at least as reliable as using our main repo github, but it requires a bit of extra coordination. |
Anyway, the failure seems to be rare enough, so I'm happy to merge and cross that bridge when we need to. |
Why were the last builds cancelled |
@hmaarrfk I've noticed entire builds disappearing in other PRs. I think there must have been a glitch with Travis. At any rate, this passes locally. Given approvals by @stefanv and @emmanuelle, I'm inclined to squash and merge. Thoughts? |
At some point, we must jump and trust the paragliding wing :) |
As someone who's built parts of the paragliding wing, this makes me slightly nervous 🙂 |
the rebase messed up some travis build stuff. Hopefully this one will work. |
Lights are all green, let's pull the trigger. |
LOL that merge hash 👆 is ominous! 😅 At any rate, this is amazing! We can start using real scientific datasets in our docs! 🎉 |
That's crazy! Super happy that this is in! |
The project switched to pooch as the backend for data management in scikit-image/scikit-image#3945
Even though skimage.data uses lazy loading, it's submodule _fetcher.py is executed when skimage is imported, because it's attribute `data_dir` is imported in multiple places [1]. Previously, this lead to _init_pooch() being executed always, which in turn tried to create the data directory preemptively. This lead to problems when data_dir isn't writeable, e.g. when scikit-image used in read-only containers. This refactoring alleviates this by post-poning the directory creation until it is actually needed and data is being downloaded. Calling `download_all()` also ensures that legacy files are copied to `data_dir`; this use case was requested in [2] and should be preserved this way. With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. _fetchers.py should probably be refactored into a class whose API is less intertwined with global state; that should be easier to properly test. [1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141 [2] scikit-image#3945 (comment)
Even though skimage.data uses lazy loading, its submodule `_fetcher.py` is executed when `skimage` is imported, because its attribute `data_dir` is imported in multiple places [1]. Previously, this lead to `_init_pooch()` being always executed, which in turn tried to create the data directory preemptively. This lead to problems when `data_dir` wasn't writeable, e.g. when scikit-image is used in read-only containers. This refactoring alleviates this by postponing the directory creation until it is actually needed and data is being downloaded. Calling `download_all()` also ensures that legacy files are copied to `data_dir`; this use case was requested in [2] and should be preserved this way. With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. `_fetchers.py` should probably be refactored into a class whose API is less intertwined with global state; that might be easier to test. [1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141 [2] scikit-image#3945 (comment)
* Replace os.path with pathlib * Create data_dir only on actual download Even though skimage.data uses lazy loading, its submodule `_fetcher.py` is executed when `skimage` is imported, because its attribute `data_dir` is imported in multiple places [1]. Previously, this lead to `_init_pooch()` being always executed, which in turn tried to create the data directory preemptively. This lead to problems when `data_dir` wasn't writeable, e.g. when scikit-image is used in read-only containers. This refactoring alleviates this by postponing the directory creation until it is actually needed and data is being downloaded. Calling `download_all()` also ensures that legacy files are copied to `data_dir`; this use case was requested in [2] and should be preserved this way. With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. `_fetchers.py` should probably be refactored into a class whose API is less intertwined with global state; that might be easier to test. [1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141 [2] #3945 (comment) * Fix typo in docstring * Fix test errors due to pathlib refactoring Also, examples in `ImageCollection`'s docstring [1] indicate that this is part of our API, so revert `data_dir` back to being a string. * Fix return types * Revert refactoring to pathlib Long-term, I would think of an update from `os.path` to `pathlib` as removing technical debt but it was a bad call in the context of this fix. Especially, because it had unintended side-effects to our API. * Ensure cache subdir exists * Debug failure on macos-cp3.11 * Debug: Use absolute paths * Debug: ignore legacy path multipage_rgb.tif is not in our distribution archives. * Debug: remove codecov dependency Super strange, but suddenly it seems that codecov has disappeared from PyPI [1]... [1] https://pypi.org/project/codecov/ * Debug: use --showlocal for pytest * Debug: test hashes and pure imread * Remove missed codecov in pyproject.toml * Debug check sorting * Always try cache first when fetching datasets * Remove debug test * Remove copy_legacy_to_cache flag in _fetch Instead, it is now the task of `download_all(directory=...)`` to place a copy of every data_file in `directory` or - if not given - the default cache directory. This also addresses another previously undiscovered bug. Running import skimage as ski ski.data.download_all() ski.data.download_all(directory="example_dir") would not create anything in "example_dir" because `_fetch` would always return the cached entry before ever invoking pooches cache mechanism to place it in "example_dir". * Use proper cache_dir in _fetch without pooch too * Expand user in download_all Previously, running import skimage as ski ski.data.download_all("~/skimage-data") would place files at two locations: files in the distribution are placed in "[working_dir]/~/skimage-data" while files downloaded with pooch were placed in /home/[user]/skimage-data. I think this was because our old os.path machinery doesn't resolve ~ while pooch uses pathlib which does. To address this we make download_all explicitly expand the user if directory is given. --------- Co-authored-by: Stefan van der Walt <[email protected]>
Description
This PR builds up the infrastructure for Pooch. It would allow users not to have to download the datasets through pypi.
The main advantage would be to allow contributors to contribute larger datasets without massively increasing the size of the downloaded wheel.
xref: #3605 #3323 #3324
Proposed solution:
data.cell
.data_dir
remains this string-like object that points to a path on the user's local computer. It contains at least the minimum data that we choose to include.The most important aspect is that, it allows us to add new, larger datasets in the future. Something that is impossible now.
Future enhancements:
data
to avoid creating and hashing datasets when the user doens't want to use sample data. This is currently being proposed for a particularly costly numpy import MAINT: Lazy import testing on python >=3.7 numpy/numpy#14097challenges encountered:
data_dir
. Thedata_dir
now contains a minimal subset of images, that we can control.Results
python setup.py bdist_wheel
creates a 41 MB wheel archiveThe 27 MB breaks down into 2 important contributions:
From this, we may shave off somewhere close to 27MB of the wheel, which would create archives on the order of 10-15MB instead of 30-35 MB.
Checklist
./doc/examples
(new features only)./benchmarks
, if your changes aren't covered by anexisting benchmark
For reviewers
later.
__init__.py
.doc/release/release_dev.rst
.@meeseeksdev backport to v0.14.x