Pooch -- on the fly download of datasets from github #3945

hmaarrfk · 2019-06-02T19:00:23Z

Description

This PR builds up the infrastructure for Pooch. It would allow users not to have to download the datasets through pypi.
The main advantage would be to allow contributors to contribute larger datasets without massively increasing the size of the downloaded wheel.

xref: #3605 #3323 #3324

Proposed solution:

The old datasets are considered legacy the ability to load them is transparent to this change.
Pooch is used to download future user facing datasets. The first one of these is data.cell.
Pooch is used to download test datasets.
Testing from a git development build will not require pooch.
Pooch is an optional dependency, not a core dependency.
Tests that need datasets are skipped, and don't fail.
Data is shipped in the sdist
Chosen design does not require any particular organization in the git repo.
data_dir remains this string-like object that points to a path on the user's local computer. It contains at least the minimum data that we choose to include.
If github sets a throttling limit, we can release a new patch release which would fix issues to users.

The most important aspect is that, it allows us to add new, larger datasets in the future. Something that is impossible now.

Future enhancements:

Lazy load data to avoid creating and hashing datasets when the user doens't want to use sample data. This is currently being proposed for a particularly costly numpy import MAINT: Lazy import testing on python >=3.7 numpy/numpy#14097

challenges encountered:

Somewhat breaks data_dir. The data_dir now contains a minimal subset of images, that we can control.
We had to hack utilizing the git directory as a secondary cache from pooch.
Depending on pooch is hard because it depends on all alot of the web stack. Not really fun to depend on and may negate the benefits for many users.

Results

On master, python setup.py bdist_wheel creates a 41 MB wheel archive
On this branch, it creates a 27 MB archive.

The 27 MB breaks down into 2 important contributions:

The legacy datasets, approximately 7MB
Cython stuff that gets stripped away in our distributions by the wheel builder.

From this, we may shave off somewhere close to 27MB of the wheel, which would create archives on the order of 10-15MB instead of 30-35 MB.

Checklist

Docstrings for all functions
Gallery example in ./doc/examples (new features only)
Benchmark in ./benchmarks, if your changes aren't covered by an
existing benchmark
Unit tests
Clean style in the spirit of PEP8

For reviewers

Check that the PR title is short, concise, and will make sense 1 year
later.
Check that new functions are imported in corresponding __init__.py.
Check that new features, API changes, and deprecations are mentioned in
doc/release/release_dev.rst.
Consider backporting the PR with @meeseeksdev backport to v0.14.x

pep8speaks · 2019-06-02T19:00:26Z

Hello @hmaarrfk! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file skimage/data/__init__.py:

Line 61:1: E402 module level import not at top of file
Line 62:1: E402 module level import not at top of file
Line 139:5: E303 too many blank lines (2)

In the file skimage/data/_registry.py:

Line 43:80: E501 line too long (109 > 79 characters)
Line 44:80: E501 line too long (112 > 79 characters)
Line 45:80: E501 line too long (111 > 79 characters)
Line 46:80: E501 line too long (112 > 79 characters)
Line 47:80: E501 line too long (111 > 79 characters)
Line 48:80: E501 line too long (112 > 79 characters)
Line 49:80: E501 line too long (111 > 79 characters)
Line 50:80: E501 line too long (112 > 79 characters)
Line 51:80: E501 line too long (111 > 79 characters)
Line 52:80: E501 line too long (109 > 79 characters)
Line 53:80: E501 line too long (109 > 79 characters)
Line 54:80: E501 line too long (112 > 79 characters)
Line 55:80: E501 line too long (111 > 79 characters)
Line 56:80: E501 line too long (112 > 79 characters)
Line 57:80: E501 line too long (111 > 79 characters)
Line 58:80: E501 line too long (112 > 79 characters)
Line 59:80: E501 line too long (111 > 79 characters)
Line 60:80: E501 line too long (112 > 79 characters)
Line 61:80: E501 line too long (111 > 79 characters)
Line 62:80: E501 line too long (109 > 79 characters)
Line 63:80: E501 line too long (110 > 79 characters)
Line 64:80: E501 line too long (93 > 79 characters)
Line 65:80: E501 line too long (89 > 79 characters)
Line 66:80: E501 line too long (88 > 79 characters)
Line 67:80: E501 line too long (90 > 79 characters)
Line 68:80: E501 line too long (99 > 79 characters)
Line 69:80: E501 line too long (98 > 79 characters)
Line 70:80: E501 line too long (91 > 79 characters)
Line 71:80: E501 line too long (96 > 79 characters)
Line 72:80: E501 line too long (90 > 79 characters)
Line 73:80: E501 line too long (89 > 79 characters)
Line 74:80: E501 line too long (89 > 79 characters)
Line 75:80: E501 line too long (89 > 79 characters)
Line 76:80: E501 line too long (89 > 79 characters)
Line 77:80: E501 line too long (101 > 79 characters)
Line 78:80: E501 line too long (87 > 79 characters)
Line 79:80: E501 line too long (88 > 79 characters)
Line 80:80: E501 line too long (94 > 79 characters)
Line 81:80: E501 line too long (98 > 79 characters)
Line 82:80: E501 line too long (88 > 79 characters)
Line 83:80: E501 line too long (99 > 79 characters)
Line 84:80: E501 line too long (100 > 79 characters)
Line 85:80: E501 line too long (99 > 79 characters)
Line 86:80: E501 line too long (103 > 79 characters)
Line 87:80: E501 line too long (88 > 79 characters)
Line 88:80: E501 line too long (91 > 79 characters)
Line 89:80: E501 line too long (90 > 79 characters)
Line 90:80: E501 line too long (90 > 79 characters)
Line 91:80: E501 line too long (90 > 79 characters)
Line 92:80: E501 line too long (88 > 79 characters)
Line 93:80: E501 line too long (103 > 79 characters)
Line 94:80: E501 line too long (104 > 79 characters)
Line 95:80: E501 line too long (102 > 79 characters)
Line 96:80: E501 line too long (113 > 79 characters)
Line 97:80: E501 line too long (105 > 79 characters)
Line 98:80: E501 line too long (109 > 79 characters)
Line 99:80: E501 line too long (101 > 79 characters)
Line 100:80: E501 line too long (90 > 79 characters)
Line 101:80: E501 line too long (91 > 79 characters)
Line 102:80: E501 line too long (97 > 79 characters)
Line 103:80: E501 line too long (96 > 79 characters)
Line 104:80: E501 line too long (97 > 79 characters)
Line 105:80: E501 line too long (93 > 79 characters)
Line 106:80: E501 line too long (93 > 79 characters)
Line 107:80: E501 line too long (97 > 79 characters)
Line 108:80: E501 line too long (105 > 79 characters)
Line 109:80: E501 line too long (99 > 79 characters)
Line 110:80: E501 line too long (103 > 79 characters)
Line 111:80: E501 line too long (101 > 79 characters)
Line 112:80: E501 line too long (102 > 79 characters)
Line 113:80: E501 line too long (105 > 79 characters)
Line 114:80: E501 line too long (91 > 79 characters)
Line 115:80: E501 line too long (100 > 79 characters)
Line 116:80: E501 line too long (107 > 79 characters)
Line 117:80: E501 line too long (99 > 79 characters)
Line 118:80: E501 line too long (106 > 79 characters)
Line 119:80: E501 line too long (109 > 79 characters)
Line 120:80: E501 line too long (110 > 79 characters)
Line 121:80: E501 line too long (110 > 79 characters)
Line 122:80: E501 line too long (115 > 79 characters)
Line 123:80: E501 line too long (114 > 79 characters)
Line 124:80: E501 line too long (112 > 79 characters)
Line 125:80: E501 line too long (118 > 79 characters)
Line 126:80: E501 line too long (117 > 79 characters)
Line 127:80: E501 line too long (115 > 79 characters)
Line 128:80: E501 line too long (89 > 79 characters)
Line 132:80: E501 line too long (102 > 79 characters)

Comment last updated at 2020-03-12 12:46:07 UTC

hmaarrfk · 2019-06-02T20:07:10Z

pooch requires requests do we really want to depend on the whole http stack?

emmanuelle · 2019-06-02T20:09:49Z

is the plan to have pooch as an optional dependency or a required one?

hmaarrfk · 2019-06-02T20:21:30Z

Not too sure, I think optional dependency is nice for this case. I think it really depends on how complicated we want to make "installing scikit-image"

Our optional requirements are all very heavy.

Would we start to recommend everybody install

pip install scikit-image[optional]

hmaarrfk · 2019-06-02T23:58:23Z

skimage/data/__init__.py

@@ -32,18 +37,110 @@
           'horse',
           'hubble_deep_field',
           'immunohistochemistry',
-           'lbp_frontal_face_cascade_filename',


was this ever meant to be a public image? I kinda removed this from the API. Seemed to be important for tests only.

Does it appear in the data docs? If so we should definitely keep it. Even if not we probably want to keep it since it's been public for a while...

hmaarrfk · 2019-06-02T23:58:44Z

skimage/data/__init__.py

           'lfw_subset',
           'logo',
           'microaneurysms',
           'moon',
           'page',
+           'prefetch',


new function added to the API. Probably needs thinking and discussion.

Yep, prefetch is a bit jargony, my suggestion would be download_all

hmaarrfk · 2019-06-02T23:58:52Z

skimage/data/__init__.py

           'lfw_subset',
           'logo',
           'microaneurysms',
           'moon',
           'page',
+           'prefetch',
+           'quantitative_phase_cells',


2MB data set.

hmaarrfk · 2019-06-02T23:59:35Z

skimage/data/__init__.py

+
+
+def quantitative_phase_cells():
+    """Image of two cells retrieved from a digital hologram.


This docstring definitely needs work, but I wanted to get the framework down and find the issue with the framework, rather than format docstrings for something that may never happen.

hmaarrfk · 2019-06-03T00:00:56Z

skimage/data/__init__.py

+    version = __version__
+
+# Create a new friend to manage your sample data storage
+image_fetcher = pooch.create(


this should probably be prefixed with an _ to indicate it is internal???

hmaarrfk · 2019-06-03T00:01:36Z

skimage/data/__init__.py

+    path=pooch.os_cache("scikit-image"),
+    base_url=base_url,
+    version=version,
+    env="SCIKIT_IMAGE_DATA_DIR",


This will allow developers to use:

export SCIKIT_IMAGE_DATA_DIR=/home/mark2/git/scikit-image/data

nto to fetch online.

Definitely SKIMAGE_DATADIR or similar, the current env value is way too verbose (imho)

hmaarrfk · 2019-06-03T00:01:54Z

skimage/data/__init__.py

+    }
+)
+
+fetch = image_fetcher.fetch


This shortcut was really useful in migrating all the tests.

hmaarrfk · 2019-06-03T00:06:41Z

skimage/data/__init__.py


 from ..io import imread, use_plugin
 from .._shared._warnings import expected_warnings, warn
 from ..util.dtype import img_as_bool
 from ._binary_blobs import binary_blobs
 from ._detect import lbp_frontal_face_cascade_filename

+from .. import __version__
+
 import os.path as osp
 data_dir = osp.abspath(osp.dirname(__file__))


data_dir is basically useless now.

The only migration path I can see, is to create some really hacky class, that implements the

class _data_dir: def __init__(self, image_fetcher): self.image_fetcher = image_fetcher def __str__(self): warn(....) prefetch() return str(self.image_fetcher.path) def __add__(self, filename): if filename in self.image_fetcher.registry: warn(.....) return self.image_fetcher.fetch(filename) else: # Globbing??? prefetch() return str(self.image_fetcher.path) + filename data_dir = _data_dir(image_fetcher)

Why is it useless? You think we shouldn't write to that directory ever?

the usefulness of datadir has been restored

jni · 2019-06-03T01:26:39Z

skimage/io/tests/test_tifffile.py

-    expected = np.load(fetch('test/chessboard_GRAY_U8.npy'))
-    with open(fetch('test/chessboard_GRAY_U16.tif'), 'rb') as fh:
+    expected = np.load(fetch('tests/chessboard_GRAY_U8.npy'))
+    with open(fetch('tests/chessboard_GRAY_U16.tif'), 'rb') as fh:


Why the renamed directory? So much line churn for an s...

Because the discrepancy with the regular tree would have driven me madd

Especially since I would have been the one that introduced it.

jni

@hmaarrfk thanks for making this happen! I agree that it's much easier to comment on a PR than in the abstract.

My main concern is with he existing datadir. I think probably the best thing to do is to deprecate it over several releases. Whether we do this with a hacky class or an import hook is a different question.

One thing I'm worried about is having an os-wide data dir, which means that multiple versions of skimage would share a data directory. I guess pooch can handle it if the files change between versions (re-download if the hashes don't match), but we should worry about putting something in the user guide to avoid user confusion here. OR, we can put the version string in the default directory to avoid such conflicts.

With regard to your comment about now depending on requests: it's indeed concerning, and probably the most concerning part of this approach. However, my feeling is that if we don't use pooch we will end up reengineering something much the same, but with the base http libraries, which would be even more painful. But, I'm certainly open to being convinced that it is a better approach for us. @scikit-image/core thoughts on this?

jni · 2019-06-03T01:28:11Z

requirements/default.txt

@@ -5,3 +5,4 @@ networkx>=2.0
 pillow>=4.3.0
 imageio>=2.0.1
 PyWavelets>=0.4.0
+pooch>=0.2.1


This should go into optional or a new category, recommended...

jni · 2019-06-03T01:29:38Z

skimage/data/__init__.py


 from ..io import imread, use_plugin
 from .._shared._warnings import expected_warnings, warn
 from ..util.dtype import img_as_bool
 from ._binary_blobs import binary_blobs
 from ._detect import lbp_frontal_face_cascade_filename

+from .. import __version__
+
 import os.path as osp
 data_dir = osp.abspath(osp.dirname(__file__))


Why is it useless? You think we shouldn't write to that directory ever?

jni · 2019-06-03T01:30:20Z

skimage/data/__init__.py

@@ -32,18 +37,110 @@
           'horse',
           'hubble_deep_field',
           'immunohistochemistry',
-           'lbp_frontal_face_cascade_filename',


Does it appear in the data docs? If so we should definitely keep it. Even if not we probably want to keep it since it's been public for a while...

jni · 2019-06-03T01:31:10Z

skimage/data/__init__.py

           'lfw_subset',
           'logo',
           'microaneurysms',
           'moon',
           'page',
+           'prefetch',


Yep, prefetch is a bit jargony, my suggestion would be download_all

jni · 2019-06-03T01:31:32Z

skimage/data/__init__.py

           'text',
           'retina',
           'rocket',
           'stereo_motorcycle']


+# Pooch expects a `+` to exist in development versions.
+# Since scikit-image doesn't follow that convetion, we have to manually provide
+# it with the URL and set the version to None


... Should we start to use that convention?

Not in this PR.

jni · 2019-06-03T03:42:21Z

skimage/data/__init__.py

+fetch = image_fetcher.fetch
+
+
+def prefetch():


I suggest download_all, and also a directory=SKIMAGE_DATA_DIR keyword argument, so that anyone can use this function without knowing about the magical environment variable.

jni · 2019-06-03T03:43:30Z

skimage/data/__init__.py

+
+# Create a new friend to manage your sample data storage
+image_fetcher = pooch.create(
+    path=pooch.os_cache("scikit-image"),


What is pooch.os_cache by default? Could you add it as a comment?

jni · 2019-06-03T03:46:42Z

skimage/exposure/tests/test_exposure.py

@@ -564,7 +564,7 @@ def test_is_low_contrast():
 # =======================

 def test_dask_histogram():
-    pytest.importorskip('dask', reason="dask python library is not installed")
+    pytest.importorskip('dask')


Can you explain this change?

Same change here.

jni · 2019-06-03T03:53:51Z

skimage/io/_plugins/fits_plugin.py

@@ -41,7 +41,7 @@ def imread(fname, dtype=None):
    lazy loading) to get all the extensions at once.

    """
-    if 'dtype' is not None:
+    if dtype is not None:


Thanks for your fixup in a separate PR, just saw that.

jni · 2019-06-03T04:12:51Z

skimage/io/collection.py


-    >>> coll = io.ImageCollection(data_dir + '/chess*.png')
+    >>> coll = io.ImageCollection([fetch('chessboard_GRAY.png'), fetch('coins.png')])


I feel like this is a problem, and maybe the hacky class is needed, if pooch can't handle glob expansions. Which would make sense.

hmaarrfk · 2019-06-03T04:33:36Z

@hmaarrfk thanks for making this happen! I agree that it's much easier to comment on a PR than in the abstract.

Yeah. Originally, I wanted to keep this PR small and a way to stop the data from growing, but then I figured it wouldn't be hard to see where it goes to see all the ramifications of the switch.

I'm good to reverting to a slower release too. We should have a mechanism to experiment with different technologies.

Why is it useless? You think we shouldn't write to that directory ever?

In relation to datadir, by using pooch, the user is nolongwr interacting with the directory itself and operating on specific files, but they have to use the pooch object. I'm not sure it behaves like a full fledged directory, if it did, it would be really cool, in thinking like Google drive stream. By using pooch, we are really make the data available, and taking away the directory structure.

jni · 2019-06-03T06:39:31Z

In relation to datadir, by using pooch, the user is nolongwr interacting with the directory itself and operating on specific files, but they have to use the pooch object. I'm not sure it behaves like a full fledged directory,

This is worth investigating in-depth. My hope is that using download_all will mean that users can interact with some directory on their hard drive as normal, if they so desire. (This also could minimize the impact on our tests, as we could run download-all before any testing, and then be fine as long as we redirect data_dir.

stefanv · 2019-06-03T07:20:00Z

On June 2, 2019 23:39:41 Juan Nunez-Iglesias ***@***.***> wrote: In relation to datadir, by using pooch, the user is nolongwr interacting with the directory itself and operating on specific files, but they have to use the pooch object. I'm not sure it behaves like a full fledged directory, This is worth investigating in-depth. My hope is that using download_all will mean that users can interact with some directory on their hard drive as normal, if they so desire. (This also could minimize the impact on our tests, as we could run download-all before any testing, and then be fine as long as we redirect data_dir.

I strongly support this point: files, real files with known locations. No weird objects, just files read as NumPy arrays.

hmaarrfk · 2019-06-03T11:47:58Z

My hope is that using download_all will mean that users can interact with some directory on their hard drive as normal, if they so desire.

Pooch claims to do this on a oer version level. Should be nice.

This also could minimize the impact on our tests, as we could run download-all before any testing, and then be fine as long as we redirect data_dir.

Our tests should be able to utilize the environment variable to avoid the need for any downloading.

If we do want to keep the ability to interact with datadir as before, it might be worth developing a hacky class ourselves, or working with pooch on this issue.

The one thing I will warm about is pooch dropped 3.5 support. So we would either have to version them for a bit, at least until we drop 3.5

stefanv · 2019-06-03T20:36:30Z

@leouieda How hard would it be to patch Pooch to be 3.5 compatible again?

leouieda · 2019-06-05T00:41:19Z

@stefanv not too hard. There isn't really any patching that needed to the code itself. I just checked on my machine and all tests still pass on 3.5. It's mainly that conda-forge stopped building 3.5 packages so we figured we could drop it. Do you have a timeline for scikit-image dropping 3.5?

We could add it back temporarily and drop 3.5 support later on when scikit-image is ready to do so. The main challenge is keeping the CI running 3.5 but we could probably manage that without much hassle. Nowhere near the pain that is keeping 2.7 happy.

leouieda · 2019-06-05T00:51:16Z

My hope is that using download_all will mean that users can interact with some directory on their hard drive as normal, if they so desire.

This should be possible without any changes. Pooch maintains a directory with the downloaded files (see the Pooch.abspath attribute). This can be mapped to datadir and all the data should be there after running a download_all function.

OR, we can put the version string in the default directory to avoid such conflicts.

By providing version to pooch.create the data directory already has the version in it. So different versions of scikit-image can coexist on the OS without clashing data files.

hmaarrfk · 2019-06-05T02:04:08Z

see the Pooch.abspath attribute

Cool thanks!

By providing version to pooch.create the data directory already has the version in it. So different versions of scikit-image can coexist on the OS without clashing data files.

Unfortunately, we don't include a + sign, but rather use the .dev[0-9]{0, 5} convention to denote development version numbers. Anyway we can modify Pooch to add a custom regex to detect development versions?

I think the current plan is to ship the datasets in the repo. It would be nice to also provide a secondary local cache in case users cloned the whole git repo.

leouieda · 2019-06-05T02:16:10Z

Anyway we can modify Pooch to add a custom regex to detect development versions?

For sure. Our current code is very simplistic and could use some upgrading: https://github.com/fatiando/pooch/blob/master/pooch/utils.py#L78

It would be nice to also provide a secondary local cache in case users cloned the whole git repo.

Do you mean to be able to have released versions use one local cache folder (e.g. ~/.cache/scitki-image/1.4.5) and the master branch use another (e.g. ~/src/scikit-image/data)?

If so, this would probably be a bit more complicated refactor since the current implementation doesn't really know much about versions beyond appending them to the directory. I see two options that would work right now:

Link or copy the cloned repository data to the local cache so it won't be downloaded
In the library, check if the version relates to the master branch and set a different data directory in that case.

hmaarrfk · 2019-06-05T02:25:31Z

For sure. Our current code is very simplistic and could use some upgrading: https://github.com/fatiando/pooch/blob/master/pooch/utils.py#L78

Cool, we can discuss it in Pooch's issue tracker. I think the current logic will work for us, for now, but it would be nice to be able to move it into Pooch.

Do you mean to be able to have released versions use one local cache folder (e.g. ~/.cache/scitki-image/1.4.5) and the master branch use another (e.g. ~/src/scikit-image/data)?

It wouldn't even need to be that complicated:

Check in the cache
Check in some hard coded directory. For scikit-image, this would be something like: Path('__file__/../../data').resolve()
Fetch from the URL online like the regular logic does.

hmaarrfk · 2019-06-05T02:33:55Z

I strongly support this point: files, real files with known locations

@stefanv @jni I don't see why this is important

We want to teach people to analyze their data, not load our sample images with fancy reg-exps and file manipulations.

None of the examples used the directory functionality you both allude to.

Data storage and organization seems beyond the goals of scikit-image.

I think we should encourage people to use the functions for our sample data, enabling the addition of docstrings, online storage, and caching of data in a straightforward manner.

I will very likely submit an other PR cleaning up the data used in the example to add functions with docsrings for all the used data. Once that is in, the examples will no longer even mention data_dir. This won't be because we to create obscure objects, but more because we want to make our data start as numpy arrays, as opposed to "png" or "jpg" files.

skimage/data/__init__.py

stefanv · 2019-06-05T05:37:47Z

@hmaarrfk You wrote: "they have to use the pooch object". I don't want users to even have to know about Pooch. They should just know that they can either download the images to disk, or get them through our data.* methods. Why complicate things by using intermediate objects?

sciunto · 2019-06-06T05:53:43Z

doc/examples/applications/plot_morphology.py

@@ -191,7 +190,7 @@ def plot_comparison(original, filtered, filter_name):
 #*single-pixel wide skeleton*. It is important to note that this is
 #performed on binary images only.

-horse = io.imread(os.path.join(data_dir, "horse.png"), as_gray=True)
+horse = horse()


Wouldn't it be preferable to have variable names != functions?

Most likely, but we should establish the bigger picture problems first.

jni · 2020-03-10T08:00:28Z

@alexdesiqueira I briefly suggested the Open Science Framework (osf.io), but @hmaarrfk wasn't too happy about their URLs. Another option is to make a separate repo + gh-pages service with cloudflare CDN. I have this set up for my blog (don't ask me how 😂), and it seems to work pretty well. That is at least as reliable as using our main repo github, but it requires a bit of extra coordination.

jni · 2020-03-10T08:01:11Z

Anyway, the failure seems to be rare enough, so I'm happy to merge and cross that bridge when we need to.

hmaarrfk · 2020-03-10T11:54:08Z

Why were the last builds cancelled

jni · 2020-03-12T07:19:57Z

@hmaarrfk I've noticed entire builds disappearing in other PRs. I think there must have been a glitch with Travis. At any rate, this passes locally. Given approvals by @stefanv and @emmanuelle, I'm inclined to squash and merge. Thoughts?

sciunto · 2020-03-12T12:30:21Z

At some point, we must jump and trust the paragliding wing :)

leouieda · 2020-03-12T12:36:48Z

At some point, we must jump and trust the paragliding wing :)

As someone who's built parts of the paragliding wing, this makes me slightly nervous 🙂

hmaarrfk · 2020-03-12T12:46:18Z

the rebase messed up some travis build stuff. Hopefully this one will work.

sciunto · 2020-03-12T21:14:04Z

Lights are all green, let's pull the trigger.

jni · 2020-03-12T22:34:12Z

LOL that merge hash 👆 is ominous! 😅

At any rate, this is amazing! We can start using real scientific datasets in our docs! 🎉

hmaarrfk · 2020-03-13T05:10:10Z

That's crazy!

Super happy that this is in!

The project switched to pooch as the backend for data management in scikit-image/scikit-image#3945

Even though skimage.data uses lazy loading, it's submodule _fetcher.py is executed when skimage is imported, because it's attribute `data_dir` is imported in multiple places [1]. Previously, this lead to _init_pooch() being executed always, which in turn tried to create the data directory preemptively. This lead to problems when data_dir isn't writeable, e.g. when scikit-image used in read-only containers. This refactoring alleviates this by post-poning the directory creation until it is actually needed and data is being downloaded. Calling `download_all()` also ensures that legacy files are copied to `data_dir`; this use case was requested in [2] and should be preserved this way. With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. _fetchers.py should probably be refactored into a class whose API is less intertwined with global state; that should be easier to properly test. [1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141 [2] scikit-image#3945 (comment)

Even though skimage.data uses lazy loading, its submodule `_fetcher.py` is executed when `skimage` is imported, because its attribute `data_dir` is imported in multiple places [1]. Previously, this lead to `_init_pooch()` being always executed, which in turn tried to create the data directory preemptively. This lead to problems when `data_dir` wasn't writeable, e.g. when scikit-image is used in read-only containers. This refactoring alleviates this by postponing the directory creation until it is actually needed and data is being downloaded. Calling `download_all()` also ensures that legacy files are copied to `data_dir`; this use case was requested in [2] and should be preserved this way. With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. `_fetchers.py` should probably be refactored into a class whose API is less intertwined with global state; that might be easier to test. [1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141 [2] scikit-image#3945 (comment)

* Replace os.path with pathlib * Create data_dir only on actual download Even though skimage.data uses lazy loading, its submodule `_fetcher.py` is executed when `skimage` is imported, because its attribute `data_dir` is imported in multiple places [1]. Previously, this lead to `_init_pooch()` being always executed, which in turn tried to create the data directory preemptively. This lead to problems when `data_dir` wasn't writeable, e.g. when scikit-image is used in read-only containers. This refactoring alleviates this by postponing the directory creation until it is actually needed and data is being downloaded. Calling `download_all()` also ensures that legacy files are copied to `data_dir`; this use case was requested in [2] and should be preserved this way. With the previous and current state, the behavior is somewhat difficult to test with regard to self-contained tests and multi-processing / multi-threading. `_fetchers.py` should probably be refactored into a class whose API is less intertwined with global state; that might be easier to test. [1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141 [2] #3945 (comment) * Fix typo in docstring * Fix test errors due to pathlib refactoring Also, examples in `ImageCollection`'s docstring [1] indicate that this is part of our API, so revert `data_dir` back to being a string. * Fix return types * Revert refactoring to pathlib Long-term, I would think of an update from `os.path` to `pathlib` as removing technical debt but it was a bad call in the context of this fix. Especially, because it had unintended side-effects to our API. * Ensure cache subdir exists * Debug failure on macos-cp3.11 * Debug: Use absolute paths * Debug: ignore legacy path multipage_rgb.tif is not in our distribution archives. * Debug: remove codecov dependency Super strange, but suddenly it seems that codecov has disappeared from PyPI [1]... [1] https://pypi.org/project/codecov/ * Debug: use --showlocal for pytest * Debug: test hashes and pure imread * Remove missed codecov in pyproject.toml * Debug check sorting * Always try cache first when fetching datasets * Remove debug test * Remove copy_legacy_to_cache flag in _fetch Instead, it is now the task of `download_all(directory=...)`` to place a copy of every data_file in `directory` or - if not given - the default cache directory. This also addresses another previously undiscovered bug. Running import skimage as ski ski.data.download_all() ski.data.download_all(directory="example_dir") would not create anything in "example_dir" because `_fetch` would always return the cached entry before ever invoking pooches cache mechanism to place it in "example_dir". * Use proper cache_dir in _fetch without pooch too * Expand user in download_all Previously, running import skimage as ski ski.data.download_all("~/skimage-data") would place files at two locations: files in the distribution are placed in "[working_dir]/~/skimage-data" while files downloaded with pooch were placed in /home/[user]/skimage-data. I think this was because our old os.path machinery doesn't resolve ~ while pooch uses pathlib which does. To address this we make download_all explicitly expand the user if directory is given. --------- Co-authored-by: Stefan van der Walt <[email protected]>

hmaarrfk mentioned this pull request Jun 2, 2019

should data_dir.load be a public function #3947

Closed

hmaarrfk force-pushed the pooch branch from f3683e5 to acb80b5 Compare June 2, 2019 23:55

hmaarrfk commented Jun 2, 2019

View reviewed changes

hmaarrfk commented Jun 3, 2019

View reviewed changes

jni reviewed Jun 3, 2019

View reviewed changes

stefanv reviewed Jun 5, 2019

View reviewed changes

skimage/data/__init__.py Outdated Show resolved Hide resolved

sciunto reviewed Jun 6, 2019

View reviewed changes

Add xenial for 3.7 build

d1ac11a

jni mentioned this pull request Mar 10, 2020

Rolling Ball/Sliding Paraboloid Algorithm for background estimation #3538

Closed

3.7 travis

028b47e

sciunto merged commit bad916d into scikit-image:master Mar 12, 2020

hmaarrfk deleted the pooch branch March 13, 2020 05:10

sciunto mentioned this pull request Mar 23, 2020

Document pooch for dev and users #4522

Open

This was referenced Mar 29, 2020

remove scikit-image dependency napari/napari#1061

Merged

Contributing guide: add instructions for adding a new demo dataset using pooch #4539

Closed

mkcor mentioned this pull request Apr 20, 2020

Discuss gallery structure to bring out bioimaging examples. #4601

Open

soupault mentioned this pull request May 7, 2020

Added skimage to README fatiando/pooch#168

Merged

leouieda pushed a commit to fatiando/pooch that referenced this pull request May 7, 2020

Added skimage to README list of users (#168)

193e046

The project switched to pooch as the backend for data management in scikit-image/scikit-image#3945

This was referenced May 9, 2020

When running in multi-process Pooch tries to create a directory that already exist #4660

Closed

Pooch, usability, and the future of examples #4719

Closed

grlee77 mentioned this pull request Apr 5, 2021

Discussion: solution for large data files #3323

Closed

7 tasks

grlee77 added 🙏 Feature request and removed type: new feature labels Feb 22, 2022

lagru mentioned this pull request Apr 11, 2023

Use legacy datasets without creating a data_dir #6886

Merged



		def quantitative_phase_cells():
		"""Image of two cells retrieved from a digital hologram.


		>>> coll = io.ImageCollection(data_dir + '/chess*.png')
		>>> coll = io.ImageCollection([fetch('chessboard_GRAY.png'), fetch('coins.png')])

Uh oh!

Pooch -- on the fly download of datasets from github #3945

Pooch -- on the fly download of datasets from github #3945

Uh oh!

Conversation

hmaarrfk commented Jun 2, 2019 • edited by stefanv Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Results

Checklist

For reviewers

Uh oh!

pep8speaks commented Jun 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-03-12 12:46:07 UTC

Uh oh!

hmaarrfk commented Jun 2, 2019

Uh oh!

emmanuelle commented Jun 2, 2019

Uh oh!

hmaarrfk commented Jun 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hmaarrfk commented Jun 2, 2019 •

edited by stefanv

Loading

pep8speaks commented Jun 2, 2019 •

edited

Loading

leouieda commented Jun 5, 2019 •

edited

Loading

sciunto Jun 6, 2019 •

edited

Loading