-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] ENH: dataset-fetching with use figshare and checksum #9240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ogrisel
merged 69 commits into
scikit-learn:master
from
massich:use_figshare_in_datasets
Aug 3, 2017
Merged
Changes from all commits
Commits
Show all changes
69 commits
Select commit
Hold shift + click to select a range
773f0c5
add 20newsgroups dataset to figshare
nelson-liu a61c20f
made link less verbose
nelson-liu 9e64651
add olivetti to figshare
nelson-liu b4866e6
add lfw to figshare
nelson-liu 7068152
add california housing dataset to figshare
nelson-liu 2082655
add covtype dataset to figshare
nelson-liu ff83bd1
add kddcup99 dataset to figshare
nelson-liu 59eae87
add species distribution dataset to figshare
nelson-liu f33a52c
add rcv1 dataset
nelson-liu dfe24f9
remove extraneous parens from url strings
nelson-liu 7186af8
check md5 of datasets and add resume functionality to downloads
nelson-liu 4dc8946
remove extraneous print statements
nelson-liu 7260f73
fix flake8 violations
nelson-liu f2c44ee
add docstrings to new dataset fetching functions
nelson-liu f6e6ce7
consolidate imports in base and use md5 check function in dl
nelson-liu 983544e
remove accidentally removed import
nelson-liu 03f7f82
attempt to fix docstring conventions / handle case where range header…
nelson-liu 9d39dd0
change functions to used renamed, privatized utilities
nelson-liu 5eadb3a
fix flake8 indentation error
nelson-liu 79a0325
remove checks for joblib dumped files
nelson-liu 29deaa5
fix error in lfw
nelson-liu 269d028
Merge branch 'master' into use_figshare_in_datasets
nelson-liu 773aa48
Add missing Bunch import in california housing
nelson-liu 11c15db
Remove hash validation of 20news output pkl
nelson-liu f367815
Remove unused import
nelson-liu 1637adb
Rebase 'master' into use_figshare_in_datasets
d11bc7a
address missing comments in #7429 to start the PR fresh
ef89676
update _fetch_and_verify_dataset function
7cf9422
update URL10
d604d49
Use strerr compatible with python2
7309779
Use warnings instead of StdErr (suggested by @lesteve)
0f7e66c
Fix pep8
0a9ca7d
Replace MD5 by SHA256
083acda
Fix cal_housing fetcher for the case of having the data locally
f48a919
Merge branch 'master' into use_figshare_in_datasets
38a4c02
Revert removing file when checksum fails
c9db0f3
Keep covertype's original URL as a comment
f991b2b
Rework the docstrings
fa1559f
Remove partial download
b8d8d5a
Add download compatibility with python 2.x
949d998
Add comment to clarify the usage passing a zipfile to np.load
7efa606
Fix typo
fead360
simplify some docstrings and functions
e7db2d8
Removed wired dictionaries to store remote metadata for lfw dataset
6601cbd
fixup! fix flake8 violations
2ffcfc1
Fix rcv1 and rename path to filename
lesteve 02f5a7d
Cosmit
lesteve f54eabd
Add lfw missing checksum
3c210c2
Unify fetchers to use RemoteMetaData
a897f9f
revert logger info in favor of warning
88d7f61
Add original urls as comments and tides up PY3_OR_LATER
22130a9
use urlretrieve from six
d4f9456
remove fetch_url
38ba738
Rename _fetch_remote path parameter into dirname
lesteve 5dfdafb
Use variable to remove repeated code
lesteve 1286364
Return file_path from _fetch_remote
lesteve 240bfe5
Remove blank lines after comments
lesteve 60b1153
List all links
lesteve d1250a8
Fix lfw
lesteve 580b131
Tweak comment
lesteve 7295474
Use returned value for _fetch_remote
lesteve 076efb1
Rename variable
lesteve 7fc6627
Minor changes
lesteve de80947
checksum fix
lesteve ba862fb
Remove unused imports
lesteve 7a5b9b6
Comment minor tweak
lesteve 29a0301
Convert list of remotes into tuple of remotes to ensure immutability
bf869a6
Move from print statements to logging
6daa256
Configure root logger in sklearn/__init__.py
lesteve File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,39 +6,40 @@ | |
# 2010 Fabian Pedregosa <[email protected]> | ||
# 2010 Olivier Grisel <[email protected]> | ||
# License: BSD 3 clause | ||
from __future__ import print_function | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any print statement? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is used in doctests. line: 509 and 674 |
||
|
||
import os | ||
import csv | ||
import sys | ||
import shutil | ||
from os import environ | ||
from os.path import dirname | ||
from os.path import join | ||
from os.path import exists | ||
from os.path import expanduser | ||
from os.path import isdir | ||
from os.path import splitext | ||
from os import listdir | ||
from os import makedirs | ||
from collections import namedtuple | ||
from os import environ, listdir, makedirs | ||
from os.path import dirname, exists, expanduser, isdir, join, splitext | ||
import hashlib | ||
|
||
from ..utils import Bunch | ||
from ..utils import check_random_state | ||
|
||
import numpy as np | ||
|
||
from ..utils import check_random_state | ||
from sklearn.externals.six.moves.urllib.request import urlretrieve | ||
|
||
RemoteFileMetadata = namedtuple('RemoteFileMetadata', | ||
['filename', 'url', 'checksum']) | ||
|
||
|
||
def get_data_home(data_home=None): | ||
"""Return the path of the scikit-learn data dir. | ||
|
||
This folder is used by some large dataset loaders to avoid | ||
downloading the data several times. | ||
This folder is used by some large dataset loaders to avoid downloading the | ||
data several times. | ||
|
||
By default the data dir is set to a folder named 'scikit_learn_data' | ||
in the user home folder. | ||
By default the data dir is set to a folder named 'scikit_learn_data' in the | ||
user home folder. | ||
|
||
Alternatively, it can be set by the 'SCIKIT_LEARN_DATA' environment | ||
variable or programmatically by giving an explicit folder path. The | ||
'~' symbol is expanded to the user home folder. | ||
variable or programmatically by giving an explicit folder path. The '~' | ||
symbol is expanded to the user home folder. | ||
|
||
If the folder does not already exist, it is automatically created. | ||
""" | ||
|
@@ -76,23 +77,22 @@ def load_files(container_path, description=None, categories=None, | |
file_44.txt | ||
... | ||
|
||
The folder names are used as supervised signal label names. The | ||
individual file names are not important. | ||
The folder names are used as supervised signal label names. The individual | ||
file names are not important. | ||
|
||
This function does not try to extract features into a numpy array or | ||
scipy sparse matrix. In addition, if load_content is false it | ||
does not try to load the files in memory. | ||
This function does not try to extract features into a numpy array or scipy | ||
sparse matrix. In addition, if load_content is false it does not try to | ||
load the files in memory. | ||
|
||
To use text files in a scikit-learn classification or clustering | ||
algorithm, you will need to use the `sklearn.feature_extraction.text` | ||
module to build a feature extraction transformer that suits your | ||
problem. | ||
To use text files in a scikit-learn classification or clustering algorithm, | ||
you will need to use the `sklearn.feature_extraction.text` module to build | ||
a feature extraction transformer that suits your problem. | ||
|
||
If you set load_content=True, you should also specify the encoding of | ||
the text using the 'encoding' parameter. For many modern text files, | ||
'utf-8' will be the correct encoding. If you leave encoding equal to None, | ||
then the content will be made of bytes instead of Unicode, and you will | ||
not be able to use most functions in `sklearn.feature_extraction.text`. | ||
If you set load_content=True, you should also specify the encoding of the | ||
text using the 'encoding' parameter. For many modern text files, 'utf-8' | ||
will be the correct encoding. If you leave encoding equal to None, then the | ||
content will be made of bytes instead of Unicode, and you will not be able | ||
to use most functions in `sklearn.feature_extraction.text`. | ||
|
||
Similar feature extractors should be built for other kind of unstructured | ||
data input such as images, audio, video, ... | ||
|
@@ -109,20 +109,19 @@ def load_files(container_path, description=None, categories=None, | |
reference, etc. | ||
|
||
categories : A collection of strings or None, optional (default=None) | ||
If None (default), load all the categories. | ||
If not None, list of category names to load (other categories ignored). | ||
If None (default), load all the categories. If not None, list of | ||
category names to load (other categories ignored). | ||
|
||
load_content : boolean, optional (default=True) | ||
Whether to load or not the content of the different files. If | ||
true a 'data' attribute containing the text information is present | ||
in the data structure returned. If not, a filenames attribute | ||
gives the path to the files. | ||
Whether to load or not the content of the different files. If true a | ||
'data' attribute containing the text information is present in the data | ||
structure returned. If not, a filenames attribute gives the path to the | ||
files. | ||
|
||
encoding : string or None (default is None) | ||
If None, do not try to decode the content of the files (e.g. for | ||
images or other non-text content). | ||
If not None, encoding to use to decode text files to Unicode if | ||
load_content is True. | ||
If None, do not try to decode the content of the files (e.g. for images | ||
or other non-text content). If not None, encoding to use to decode text | ||
files to Unicode if load_content is True. | ||
|
||
decode_error : {'strict', 'ignore', 'replace'}, optional | ||
Instruction on what to do if a byte sequence is given to analyze that | ||
|
@@ -262,16 +261,15 @@ def load_wine(return_X_y=False): | |
Returns | ||
------- | ||
data : Bunch | ||
Dictionary-like object, the interesting attributes are: | ||
'data', the data to learn, 'target', the classification labels, | ||
'target_names', the meaning of the labels, 'feature_names', the | ||
meaning of the features, and 'DESCR', the | ||
full description of the dataset. | ||
Dictionary-like object, the interesting attributes are: 'data', the | ||
data to learn, 'target', the classification labels, 'target_names', the | ||
meaning of the labels, 'feature_names', the meaning of the features, | ||
and 'DESCR', the full description of the dataset. | ||
|
||
(data, target) : tuple if ``return_X_y`` is True | ||
|
||
The copy of UCI ML Wine Data Set dataset is | ||
downloaded and modified to fit standard format from: | ||
The copy of UCI ML Wine Data Set dataset is downloaded and modified to fit | ||
standard format from: | ||
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data | ||
|
||
Examples | ||
|
@@ -332,8 +330,8 @@ def load_iris(return_X_y=False): | |
Parameters | ||
---------- | ||
return_X_y : boolean, default=False. | ||
If True, returns ``(data, target)`` instead of a Bunch object. | ||
See below for more information about the `data` and `target` object. | ||
If True, returns ``(data, target)`` instead of a Bunch object. See | ||
below for more information about the `data` and `target` object. | ||
|
||
.. versionadded:: 0.18 | ||
|
||
|
@@ -709,15 +707,15 @@ def load_boston(return_X_y=False): | |
|
||
def load_sample_images(): | ||
"""Load sample images for image manipulation. | ||
|
||
Loads both, ``china`` and ``flower``. | ||
|
||
Returns | ||
------- | ||
data : Bunch | ||
Dictionary-like object with the following attributes : | ||
'images', the two sample images, 'filenames', the file | ||
names for the images, and 'DESCR' | ||
the full description of the dataset. | ||
Dictionary-like object with the following attributes : 'images', the | ||
two sample images, 'filenames', the file names for the images, and | ||
'DESCR' the full description of the dataset. | ||
|
||
Examples | ||
-------- | ||
|
@@ -799,18 +797,18 @@ def load_sample_image(image_name): | |
def _pkl_filepath(*args, **kwargs): | ||
"""Ensure different filenames for Python 2 and Python 3 pickles | ||
|
||
An object pickled under Python 3 cannot be loaded under Python 2. | ||
An object pickled under Python 2 can sometimes not be loaded | ||
correctly under Python 3 because some Python 2 strings are decoded as | ||
Python 3 strings which can be problematic for objects that use Python 2 | ||
strings as byte buffers for numerical data instead of "real" strings. | ||
An object pickled under Python 3 cannot be loaded under Python 2. An object | ||
pickled under Python 2 can sometimes not be loaded correctly under Python 3 | ||
because some Python 2 strings are decoded as Python 3 strings which can be | ||
problematic for objects that use Python 2 strings as byte buffers for | ||
numerical data instead of "real" strings. | ||
|
||
Therefore, dataset loaders in scikit-learn use different files for pickles | ||
manages by Python 2 and Python 3 in the same SCIKIT_LEARN_DATA folder so | ||
as to avoid conflicts. | ||
manages by Python 2 and Python 3 in the same SCIKIT_LEARN_DATA folder so as | ||
to avoid conflicts. | ||
|
||
args[-1] is expected to be the ".pkl" filename. Under Python 3, a | ||
suffix is inserted before the extension to s | ||
args[-1] is expected to be the ".pkl" filename. Under Python 3, a suffix is | ||
inserted before the extension to s | ||
|
||
_pkl_filepath('/path/to/folder', 'filename.pkl') returns: | ||
- /path/to/folder/filename.pkl under Python 2 | ||
|
@@ -823,3 +821,50 @@ def _pkl_filepath(*args, **kwargs): | |
basename += py3_suffix | ||
new_args = args[:-1] + (basename + ext,) | ||
return join(*new_args) | ||
|
||
|
||
def _sha256(path): | ||
"""Calculate the sha256 hash of the file at path.""" | ||
sha256hash = hashlib.sha256() | ||
chunk_size = 8192 | ||
with open(path, "rb") as f: | ||
while True: | ||
buffer = f.read(chunk_size) | ||
if not buffer: | ||
break | ||
sha256hash.update(buffer) | ||
return sha256hash.hexdigest() | ||
|
||
|
||
def _fetch_remote(remote, dirname=None): | ||
"""Helper function to download a remote dataset into path | ||
|
||
Fetch a dataset pointed by remote's url, save into path using remote's | ||
filename and ensure its integrity based on the SHA256 Checksum of the | ||
downloaded file. | ||
|
||
Parameters | ||
----------- | ||
remote : RemoteFileMetadata | ||
Named tuple containing remote dataset meta information: url, filename | ||
and checksum | ||
|
||
dirname : string | ||
Directory to save the file to. | ||
|
||
Returns | ||
------- | ||
file_path: string | ||
Full path of the created file. | ||
""" | ||
|
||
file_path = (remote.filename if dirname is None | ||
else join(dirname, remote.filename)) | ||
urlretrieve(remote.url, file_path) | ||
checksum = _sha256(file_path) | ||
if remote.checksum != checksum: | ||
raise IOError("{} has an SHA256 checksum ({}) " | ||
"differing from expected ({}), " | ||
"file may be corrupted.".format(file_path, checksum, | ||
remote.checksum)) | ||
return file_path |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not
"sklearn"
instead of__name__
?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is the just the general convention right?
I found this from the Python doc
and this from the Hitchhiker's guide to Python: