Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Pooch -- on the fly download of datasets from github #3945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Mar 12, 2020

Conversation

hmaarrfk
Copy link
Member

@hmaarrfk hmaarrfk commented Jun 2, 2019

Description

This PR builds up the infrastructure for Pooch. It would allow users not to have to download the datasets through pypi.
The main advantage would be to allow contributors to contribute larger datasets without massively increasing the size of the downloaded wheel.

xref: #3605 #3323 #3324

Proposed solution:

  1. The old datasets are considered legacy the ability to load them is transparent to this change.
  2. Pooch is used to download future user facing datasets. The first one of these is data.cell.
  3. Pooch is used to download test datasets.
  4. Testing from a git development build will not require pooch.
  5. Pooch is an optional dependency, not a core dependency.
  6. Tests that need datasets are skipped, and don't fail.
  7. Data is shipped in the sdist
  8. Chosen design does not require any particular organization in the git repo.
  9. data_dir remains this string-like object that points to a path on the user's local computer. It contains at least the minimum data that we choose to include.
  10. If github sets a throttling limit, we can release a new patch release which would fix issues to users.

The most important aspect is that, it allows us to add new, larger datasets in the future. Something that is impossible now.

Future enhancements:

  1. Lazy load data to avoid creating and hashing datasets when the user doens't want to use sample data. This is currently being proposed for a particularly costly numpy import MAINT: Lazy import testing on python >=3.7 numpy/numpy#14097

challenges encountered:

  1. Somewhat breaks data_dir. The data_dir now contains a minimal subset of images, that we can control.
  2. We had to hack utilizing the git directory as a secondary cache from pooch.
  3. Depending on pooch is hard because it depends on all alot of the web stack. Not really fun to depend on and may negate the benefits for many users.

Results

  • On master, python setup.py bdist_wheel creates a 41 MB wheel archive
  • On this branch, it creates a 27 MB archive.

The 27 MB breaks down into 2 important contributions:

  1. The legacy datasets, approximately 7MB
  2. Cython stuff that gets stripped away in our distributions by the wheel builder.

From this, we may shave off somewhere close to 27MB of the wheel, which would create archives on the order of 10-15MB instead of 30-35 MB.

Checklist

For reviewers

  • Check that the PR title is short, concise, and will make sense 1 year
    later.
  • Check that new functions are imported in corresponding __init__.py.
  • Check that new features, API changes, and deprecations are mentioned in
    doc/release/release_dev.rst.
  • Consider backporting the PR with @meeseeksdev backport to v0.14.x

@pep8speaks
Copy link

pep8speaks commented Jun 2, 2019

Hello @hmaarrfk! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 61:1: E402 module level import not at top of file
Line 62:1: E402 module level import not at top of file
Line 139:5: E303 too many blank lines (2)

Line 43:80: E501 line too long (109 > 79 characters)
Line 44:80: E501 line too long (112 > 79 characters)
Line 45:80: E501 line too long (111 > 79 characters)
Line 46:80: E501 line too long (112 > 79 characters)
Line 47:80: E501 line too long (111 > 79 characters)
Line 48:80: E501 line too long (112 > 79 characters)
Line 49:80: E501 line too long (111 > 79 characters)
Line 50:80: E501 line too long (112 > 79 characters)
Line 51:80: E501 line too long (111 > 79 characters)
Line 52:80: E501 line too long (109 > 79 characters)
Line 53:80: E501 line too long (109 > 79 characters)
Line 54:80: E501 line too long (112 > 79 characters)
Line 55:80: E501 line too long (111 > 79 characters)
Line 56:80: E501 line too long (112 > 79 characters)
Line 57:80: E501 line too long (111 > 79 characters)
Line 58:80: E501 line too long (112 > 79 characters)
Line 59:80: E501 line too long (111 > 79 characters)
Line 60:80: E501 line too long (112 > 79 characters)
Line 61:80: E501 line too long (111 > 79 characters)
Line 62:80: E501 line too long (109 > 79 characters)
Line 63:80: E501 line too long (110 > 79 characters)
Line 64:80: E501 line too long (93 > 79 characters)
Line 65:80: E501 line too long (89 > 79 characters)
Line 66:80: E501 line too long (88 > 79 characters)
Line 67:80: E501 line too long (90 > 79 characters)
Line 68:80: E501 line too long (99 > 79 characters)
Line 69:80: E501 line too long (98 > 79 characters)
Line 70:80: E501 line too long (91 > 79 characters)
Line 71:80: E501 line too long (96 > 79 characters)
Line 72:80: E501 line too long (90 > 79 characters)
Line 73:80: E501 line too long (89 > 79 characters)
Line 74:80: E501 line too long (89 > 79 characters)
Line 75:80: E501 line too long (89 > 79 characters)
Line 76:80: E501 line too long (89 > 79 characters)
Line 77:80: E501 line too long (101 > 79 characters)
Line 78:80: E501 line too long (87 > 79 characters)
Line 79:80: E501 line too long (88 > 79 characters)
Line 80:80: E501 line too long (94 > 79 characters)
Line 81:80: E501 line too long (98 > 79 characters)
Line 82:80: E501 line too long (88 > 79 characters)
Line 83:80: E501 line too long (99 > 79 characters)
Line 84:80: E501 line too long (100 > 79 characters)
Line 85:80: E501 line too long (99 > 79 characters)
Line 86:80: E501 line too long (103 > 79 characters)
Line 87:80: E501 line too long (88 > 79 characters)
Line 88:80: E501 line too long (91 > 79 characters)
Line 89:80: E501 line too long (90 > 79 characters)
Line 90:80: E501 line too long (90 > 79 characters)
Line 91:80: E501 line too long (90 > 79 characters)
Line 92:80: E501 line too long (88 > 79 characters)
Line 93:80: E501 line too long (103 > 79 characters)
Line 94:80: E501 line too long (104 > 79 characters)
Line 95:80: E501 line too long (102 > 79 characters)
Line 96:80: E501 line too long (113 > 79 characters)
Line 97:80: E501 line too long (105 > 79 characters)
Line 98:80: E501 line too long (109 > 79 characters)
Line 99:80: E501 line too long (101 > 79 characters)
Line 100:80: E501 line too long (90 > 79 characters)
Line 101:80: E501 line too long (91 > 79 characters)
Line 102:80: E501 line too long (97 > 79 characters)
Line 103:80: E501 line too long (96 > 79 characters)
Line 104:80: E501 line too long (97 > 79 characters)
Line 105:80: E501 line too long (93 > 79 characters)
Line 106:80: E501 line too long (93 > 79 characters)
Line 107:80: E501 line too long (97 > 79 characters)
Line 108:80: E501 line too long (105 > 79 characters)
Line 109:80: E501 line too long (99 > 79 characters)
Line 110:80: E501 line too long (103 > 79 characters)
Line 111:80: E501 line too long (101 > 79 characters)
Line 112:80: E501 line too long (102 > 79 characters)
Line 113:80: E501 line too long (105 > 79 characters)
Line 114:80: E501 line too long (91 > 79 characters)
Line 115:80: E501 line too long (100 > 79 characters)
Line 116:80: E501 line too long (107 > 79 characters)
Line 117:80: E501 line too long (99 > 79 characters)
Line 118:80: E501 line too long (106 > 79 characters)
Line 119:80: E501 line too long (109 > 79 characters)
Line 120:80: E501 line too long (110 > 79 characters)
Line 121:80: E501 line too long (110 > 79 characters)
Line 122:80: E501 line too long (115 > 79 characters)
Line 123:80: E501 line too long (114 > 79 characters)
Line 124:80: E501 line too long (112 > 79 characters)
Line 125:80: E501 line too long (118 > 79 characters)
Line 126:80: E501 line too long (117 > 79 characters)
Line 127:80: E501 line too long (115 > 79 characters)
Line 128:80: E501 line too long (89 > 79 characters)
Line 132:80: E501 line too long (102 > 79 characters)

Comment last updated at 2020-03-12 12:46:07 UTC

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Jun 2, 2019

pooch requires requests do we really want to depend on the whole http stack?

@emmanuelle
Copy link
Member

is the plan to have pooch as an optional dependency or a required one?

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Jun 2, 2019

Not too sure, I think optional dependency is nice for this case. I think it really depends on how complicated we want to make "installing scikit-image"

Our optional requirements are all very heavy.

Would we start to recommend everybody install

pip install scikit-image[optional]

@@ -32,18 +37,110 @@
'horse',
'hubble_deep_field',
'immunohistochemistry',
'lbp_frontal_face_cascade_filename',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this ever meant to be a public image? I kinda removed this from the API. Seemed to be important for tests only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it appear in the data docs? If so we should definitely keep it. Even if not we probably want to keep it since it's been public for a while...

'lfw_subset',
'logo',
'microaneurysms',
'moon',
'page',
'prefetch',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new function added to the API. Probably needs thinking and discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, prefetch is a bit jargony, my suggestion would be download_all

'lfw_subset',
'logo',
'microaneurysms',
'moon',
'page',
'prefetch',
'quantitative_phase_cells',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2MB data set.



def quantitative_phase_cells():
"""Image of two cells retrieved from a digital hologram.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring definitely needs work, but I wanted to get the framework down and find the issue with the framework, rather than format docstrings for something that may never happen.

version = __version__

# Create a new friend to manage your sample data storage
image_fetcher = pooch.create(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be prefixed with an _ to indicate it is internal???

path=pooch.os_cache("scikit-image"),
base_url=base_url,
version=version,
env="SCIKIT_IMAGE_DATA_DIR",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will allow developers to use:

export SCIKIT_IMAGE_DATA_DIR=/home/mark2/git/scikit-image/data

nto to fetch online.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely SKIMAGE_DATADIR or similar, the current env value is way too verbose (imho)

}
)

fetch = image_fetcher.fetch
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shortcut was really useful in migrating all the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


from ..io import imread, use_plugin
from .._shared._warnings import expected_warnings, warn
from ..util.dtype import img_as_bool
from ._binary_blobs import binary_blobs
from ._detect import lbp_frontal_face_cascade_filename

from .. import __version__

import os.path as osp
data_dir = osp.abspath(osp.dirname(__file__))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_dir is basically useless now.

The only migration path I can see, is to create some really hacky class, that implements the

class _data_dir:
    def __init__(self, image_fetcher):
         self.image_fetcher = image_fetcher
    def __str__(self):
          warn(....)
          prefetch()
          return str(self.image_fetcher.path)
    def __add__(self, filename):
         if filename in self.image_fetcher.registry:
                warn(.....)
                return self.image_fetcher.fetch(filename)
         else:  # Globbing???
                prefetch()
                return str(self.image_fetcher.path) + filename

data_dir = _data_dir(image_fetcher)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it useless? You think we shouldn't write to that directory ever?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the usefulness of datadir has been restored

expected = np.load(fetch('test/chessboard_GRAY_U8.npy'))
with open(fetch('test/chessboard_GRAY_U16.tif'), 'rb') as fh:
expected = np.load(fetch('tests/chessboard_GRAY_U8.npy'))
with open(fetch('tests/chessboard_GRAY_U16.tif'), 'rb') as fh:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the renamed directory? So much line churn for an s...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the discrepancy with the regular tree would have driven me madd

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially since I would have been the one that introduced it.

Copy link
Member

@jni jni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmaarrfk thanks for making this happen! I agree that it's much easier to comment on a PR than in the abstract.

My main concern is with he existing datadir. I think probably the best thing to do is to deprecate it over several releases. Whether we do this with a hacky class or an import hook is a different question.

One thing I'm worried about is having an os-wide data dir, which means that multiple versions of skimage would share a data directory. I guess pooch can handle it if the files change between versions (re-download if the hashes don't match), but we should worry about putting something in the user guide to avoid user confusion here. OR, we can put the version string in the default directory to avoid such conflicts.

With regard to your comment about now depending on requests: it's indeed concerning, and probably the most concerning part of this approach. However, my feeling is that if we don't use pooch we will end up reengineering something much the same, but with the base http libraries, which would be even more painful. But, I'm certainly open to being convinced that it is a better approach for us. @scikit-image/core thoughts on this?

@@ -5,3 +5,4 @@ networkx>=2.0
pillow>=4.3.0
imageio>=2.0.1
PyWavelets>=0.4.0
pooch>=0.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go into optional or a new category, recommended...


from ..io import imread, use_plugin
from .._shared._warnings import expected_warnings, warn
from ..util.dtype import img_as_bool
from ._binary_blobs import binary_blobs
from ._detect import lbp_frontal_face_cascade_filename

from .. import __version__

import os.path as osp
data_dir = osp.abspath(osp.dirname(__file__))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it useless? You think we shouldn't write to that directory ever?

@@ -32,18 +37,110 @@
'horse',
'hubble_deep_field',
'immunohistochemistry',
'lbp_frontal_face_cascade_filename',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it appear in the data docs? If so we should definitely keep it. Even if not we probably want to keep it since it's been public for a while...

'lfw_subset',
'logo',
'microaneurysms',
'moon',
'page',
'prefetch',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, prefetch is a bit jargony, my suggestion would be download_all

'text',
'retina',
'rocket',
'stereo_motorcycle']


# Pooch expects a `+` to exist in development versions.
# Since scikit-image doesn't follow that convetion, we have to manually provide
# it with the URL and set the version to None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Should we start to use that convention?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR.

fetch = image_fetcher.fetch


def prefetch():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest download_all, and also a directory=SKIMAGE_DATA_DIR keyword argument, so that anyone can use this function without knowing about the magical environment variable.


# Create a new friend to manage your sample data storage
image_fetcher = pooch.create(
path=pooch.os_cache("scikit-image"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is pooch.os_cache by default? Could you add it as a comment?

@@ -564,7 +564,7 @@ def test_is_low_contrast():
# =======================

def test_dask_histogram():
pytest.importorskip('dask', reason="dask python library is not installed")
pytest.importorskip('dask')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same change here.

@@ -41,7 +41,7 @@ def imread(fname, dtype=None):
lazy loading) to get all the extensions at once.

"""
if 'dtype' is not None:
if dtype is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. LOL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your fixup in a separate PR, just saw that.


>>> coll = io.ImageCollection(data_dir + '/chess*.png')
>>> coll = io.ImageCollection([fetch('chessboard_GRAY.png'), fetch('coins.png')])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a problem, and maybe the hacky class is needed, if pooch can't handle glob expansions. Which would make sense.

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Jun 3, 2019

@hmaarrfk thanks for making this happen! I agree that it's much easier to comment on a PR than in the abstract.

Yeah. Originally, I wanted to keep this PR small and a way to stop the data from growing, but then I figured it wouldn't be hard to see where it goes to see all the ramifications of the switch.

I'm good to reverting to a slower release too. We should have a mechanism to experiment with different technologies.

Why is it useless? You think we shouldn't write to that directory ever?

In relation to datadir, by using pooch, the user is nolongwr interacting with the directory itself and operating on specific files, but they have to use the pooch object. I'm not sure it behaves like a full fledged directory, if it did, it would be really cool, in thinking like Google drive stream. By using pooch, we are really make the data available, and taking away the directory structure.

@jni
Copy link
Member

jni commented Jun 3, 2019

In relation to datadir, by using pooch, the user is nolongwr interacting with the directory itself and operating on specific files, but they have to use the pooch object. I'm not sure it behaves like a full fledged directory,

This is worth investigating in-depth. My hope is that using download_all will mean that users can interact with some directory on their hard drive as normal, if they so desire. (This also could minimize the impact on our tests, as we could run download-all before any testing, and then be fine as long as we redirect data_dir.

@stefanv
Copy link
Member

stefanv commented Jun 3, 2019 via email

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Jun 3, 2019

My hope is that using download_all will mean that users can interact with some directory on their hard drive as normal, if they so desire.

Pooch claims to do this on a oer version level. Should be nice.

This also could minimize the impact on our tests, as we could run download-all before any testing, and then be fine as long as we redirect data_dir.

Our tests should be able to utilize the environment variable to avoid the need for any downloading.

If we do want to keep the ability to interact with datadir as before, it might be worth developing a hacky class ourselves, or working with pooch on this issue.

The one thing I will warm about is pooch dropped 3.5 support. So we would either have to version them for a bit, at least until we drop 3.5

@stefanv
Copy link
Member

stefanv commented Jun 3, 2019

@leouieda How hard would it be to patch Pooch to be 3.5 compatible again?

@leouieda
Copy link

leouieda commented Jun 5, 2019

@stefanv not too hard. There isn't really any patching that needed to the code itself. I just checked on my machine and all tests still pass on 3.5. It's mainly that conda-forge stopped building 3.5 packages so we figured we could drop it. Do you have a timeline for scikit-image dropping 3.5?

We could add it back temporarily and drop 3.5 support later on when scikit-image is ready to do so. The main challenge is keeping the CI running 3.5 but we could probably manage that without much hassle. Nowhere near the pain that is keeping 2.7 happy.

@leouieda
Copy link

leouieda commented Jun 5, 2019

My hope is that using download_all will mean that users can interact with some directory on their hard drive as normal, if they so desire.

This should be possible without any changes. Pooch maintains a directory with the downloaded files (see the Pooch.abspath attribute). This can be mapped to datadir and all the data should be there after running a download_all function.

OR, we can put the version string in the default directory to avoid such conflicts.

By providing version to pooch.create the data directory already has the version in it. So different versions of scikit-image can coexist on the OS without clashing data files.

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Jun 5, 2019

see the Pooch.abspath attribute

Cool thanks!

By providing version to pooch.create the data directory already has the version in it. So different versions of scikit-image can coexist on the OS without clashing data files.

Unfortunately, we don't include a + sign, but rather use the .dev[0-9]{0, 5} convention to denote development version numbers. Anyway we can modify Pooch to add a custom regex to detect development versions?

I think the current plan is to ship the datasets in the repo. It would be nice to also provide a secondary local cache in case users cloned the whole git repo.

@leouieda
Copy link

leouieda commented Jun 5, 2019

Anyway we can modify Pooch to add a custom regex to detect development versions?

For sure. Our current code is very simplistic and could use some upgrading: https://github.com/fatiando/pooch/blob/master/pooch/utils.py#L78

It would be nice to also provide a secondary local cache in case users cloned the whole git repo.

Do you mean to be able to have released versions use one local cache folder (e.g. ~/.cache/scitki-image/1.4.5) and the master branch use another (e.g. ~/src/scikit-image/data)?

If so, this would probably be a bit more complicated refactor since the current implementation doesn't really know much about versions beyond appending them to the directory. I see two options that would work right now:

  1. Link or copy the cloned repository data to the local cache so it won't be downloaded
  2. In the library, check if the version relates to the master branch and set a different data directory in that case.

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Jun 5, 2019

For sure. Our current code is very simplistic and could use some upgrading: https://github.com/fatiando/pooch/blob/master/pooch/utils.py#L78

Cool, we can discuss it in Pooch's issue tracker. I think the current logic will work for us, for now, but it would be nice to be able to move it into Pooch.

Do you mean to be able to have released versions use one local cache folder (e.g. ~/.cache/scitki-image/1.4.5) and the master branch use another (e.g. ~/src/scikit-image/data)?

It wouldn't even need to be that complicated:

  1. Check in the cache
  2. Check in some hard coded directory. For scikit-image, this would be something like: Path('__file__/../../data').resolve()
  3. Fetch from the URL online like the regular logic does.

@hmaarrfk
Copy link
Member Author

hmaarrfk commented Jun 5, 2019

I strongly support this point: files, real files with known locations

@stefanv @jni I don't see why this is important

We want to teach people to analyze their data, not load our sample images with fancy reg-exps and file manipulations.

None of the examples used the directory functionality you both allude to.

Data storage and organization seems beyond the goals of scikit-image.

I think we should encourage people to use the functions for our sample data, enabling the addition of docstrings, online storage, and caching of data in a straightforward manner.

I will very likely submit an other PR cleaning up the data used in the example to add functions with docsrings for all the used data. Once that is in, the examples will no longer even mention data_dir. This won't be because we to create obscure objects, but more because we want to make our data start as numpy arrays, as opposed to "png" or "jpg" files.

@stefanv
Copy link
Member

stefanv commented Jun 5, 2019

@hmaarrfk You wrote: "they have to use the pooch object". I don't want users to even have to know about Pooch. They should just know that they can either download the images to disk, or get them through our data.* methods. Why complicate things by using intermediate objects?

@@ -191,7 +190,7 @@ def plot_comparison(original, filtered, filter_name):
#*single-pixel wide skeleton*. It is important to note that this is
#performed on binary images only.

horse = io.imread(os.path.join(data_dir, "horse.png"), as_gray=True)
horse = horse()
Copy link
Member

@sciunto sciunto Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be preferable to have variable names != functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely, but we should establish the bigger picture problems first.

@jni
Copy link
Member

jni commented Mar 10, 2020

@alexdesiqueira I briefly suggested the Open Science Framework (osf.io), but @hmaarrfk wasn't too happy about their URLs. Another option is to make a separate repo + gh-pages service with cloudflare CDN. I have this set up for my blog (don't ask me how 😂), and it seems to work pretty well. That is at least as reliable as using our main repo github, but it requires a bit of extra coordination.

@jni
Copy link
Member

jni commented Mar 10, 2020

Anyway, the failure seems to be rare enough, so I'm happy to merge and cross that bridge when we need to.

@hmaarrfk
Copy link
Member Author

Why were the last builds cancelled

@jni
Copy link
Member

jni commented Mar 12, 2020

@hmaarrfk I've noticed entire builds disappearing in other PRs. I think there must have been a glitch with Travis. At any rate, this passes locally. Given approvals by @stefanv and @emmanuelle, I'm inclined to squash and merge. Thoughts?

@sciunto
Copy link
Member

sciunto commented Mar 12, 2020

At some point, we must jump and trust the paragliding wing :)

@leouieda
Copy link

At some point, we must jump and trust the paragliding wing :)

As someone who's built parts of the paragliding wing, this makes me slightly nervous 🙂

@hmaarrfk
Copy link
Member Author

the rebase messed up some travis build stuff. Hopefully this one will work.

@sciunto
Copy link
Member

sciunto commented Mar 12, 2020

Lights are all green, let's pull the trigger.

@sciunto sciunto merged commit bad916d into scikit-image:master Mar 12, 2020
@jni
Copy link
Member

jni commented Mar 12, 2020

LOL that merge hash 👆 is ominous! 😅

At any rate, this is amazing! We can start using real scientific datasets in our docs! 🎉

@hmaarrfk
Copy link
Member Author

That's crazy!

Super happy that this is in!

@hmaarrfk hmaarrfk deleted the pooch branch March 13, 2020 05:10
leouieda pushed a commit to fatiando/pooch that referenced this pull request May 7, 2020
The project switched to pooch as the backend for data management
in scikit-image/scikit-image#3945
lagru added a commit to lagru/scikit-image that referenced this pull request Apr 11, 2023
Even though skimage.data uses lazy loading, it's submodule _fetcher.py
is executed when skimage is imported, because it's attribute `data_dir`
is imported in multiple places [1]. Previously, this lead to
_init_pooch() being executed always, which in turn tried to create the
data directory preemptively. This lead to problems when data_dir isn't
writeable, e.g. when scikit-image used in read-only containers.

This refactoring alleviates this by post-poning the directory creation
until it is actually needed and data is being downloaded. Calling
`download_all()` also ensures that legacy files are copied to
`data_dir`; this use case was requested in [2] and should be preserved
this way.

With the previous and current state, the behavior is somewhat difficult
to test with regard to self-contained tests and multi-processing /
multi-threading. _fetchers.py should probably be refactored into a
class whose API is less intertwined with global state; that should be
easier to properly test.

[1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141
[2] scikit-image#3945 (comment)
lagru added a commit to lagru/scikit-image that referenced this pull request Apr 11, 2023
Even though skimage.data uses lazy loading, its submodule `_fetcher.py`
is executed when `skimage` is imported, because its attribute `data_dir`
is imported in multiple places [1]. Previously, this lead to
`_init_pooch()` being always executed, which in turn tried to create the
data directory preemptively. This lead to problems when `data_dir`
wasn't writeable, e.g. when scikit-image is used in read-only
containers.

This refactoring alleviates this by postponing the directory creation
until it is actually needed and data is being downloaded. Calling
`download_all()` also ensures that legacy files are copied to
`data_dir`; this use case was requested in [2] and should be preserved
this way.

With the previous and current state, the behavior is somewhat difficult
to test with regard to self-contained tests and multi-processing /
multi-threading. `_fetchers.py` should probably be refactored into a
class whose API is less intertwined with global state; that might be
easier to test.

[1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141
[2] scikit-image#3945 (comment)
stefanv added a commit that referenced this pull request Apr 14, 2023
* Replace os.path with pathlib

* Create data_dir only on actual download

Even though skimage.data uses lazy loading, its submodule `_fetcher.py`
is executed when `skimage` is imported, because its attribute `data_dir`
is imported in multiple places [1]. Previously, this lead to
`_init_pooch()` being always executed, which in turn tried to create the
data directory preemptively. This lead to problems when `data_dir`
wasn't writeable, e.g. when scikit-image is used in read-only
containers.

This refactoring alleviates this by postponing the directory creation
until it is actually needed and data is being downloaded. Calling
`download_all()` also ensures that legacy files are copied to
`data_dir`; this use case was requested in [2] and should be preserved
this way.

With the previous and current state, the behavior is somewhat difficult
to test with regard to self-contained tests and multi-processing /
multi-threading. `_fetchers.py` should probably be refactored into a
class whose API is less intertwined with global state; that might be
easier to test.

[1] https://github.com/scikit-image/scikit-image/blob/51225d16ddebffacd0ccd9a54d06a673e3caff98/skimage/__init__.py#L141
[2] #3945 (comment)

* Fix typo in docstring

* Fix test errors due to pathlib refactoring

Also, examples in `ImageCollection`'s docstring [1] indicate that this
is part of our API, so revert `data_dir` back to being a string.

* Fix return types

* Revert refactoring to pathlib

Long-term, I would think of an update from `os.path` to `pathlib` as
removing technical debt but it was a bad call in the context of this
fix. Especially, because it had unintended side-effects to our API.

* Ensure cache subdir exists

* Debug failure on macos-cp3.11

* Debug: Use absolute paths

* Debug: ignore legacy path

multipage_rgb.tif is not in our distribution archives.

* Debug: remove codecov dependency

Super strange, but suddenly it seems that codecov has disappeared from
PyPI [1]...

[1] https://pypi.org/project/codecov/

* Debug: use --showlocal for pytest

* Debug: test hashes and pure imread

* Remove missed codecov in pyproject.toml

* Debug check sorting

* Always try cache first when fetching datasets

* Remove debug test

* Remove copy_legacy_to_cache flag in _fetch

Instead, it is now the task of `download_all(directory=...)`` to place a
copy of every data_file in `directory` or - if not given - the default
cache directory.

This also addresses another previously undiscovered bug. Running

    import skimage as ski
    ski.data.download_all()
    ski.data.download_all(directory="example_dir")

would not create anything in "example_dir" because `_fetch` would always
return the cached entry before ever invoking pooches cache mechanism to
place it in "example_dir".

* Use proper cache_dir in _fetch without pooch too

* Expand user in download_all

Previously, running

    import skimage as ski
    ski.data.download_all("~/skimage-data")

would place files at two locations: files in the distribution are placed
in "[working_dir]/~/skimage-data" while files downloaded with pooch were
placed in /home/[user]/skimage-data. I think this was because our old
os.path machinery doesn't resolve ~ while pooch uses pathlib which does.

To address this we make download_all explicitly expand the user if
directory is given.

---------

Co-authored-by: Stefan van der Walt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants