-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
MNT: replace __file__ with importlib.resources (compat with pyOxidizer and zipapp) #20297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| DATA_MODULE = "sklearn.datasets.data" | ||
| DESCR_MODULE = "sklearn.datasets.descr" | ||
| IMAGES_MODULE = "sklearn.datasets.images" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both string and module work. Using string here so that it can be more conveniently serialized (e.g. pickle) as part of a Bunch.
sklearn/datasets/_base.py
Outdated
| feature_names=feature_names, | ||
| filename=csv_filename, | ||
| filename=data_file_name, | ||
| file_module=DATA_MODULE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be compatible with pyOxidizer and zipapp, we can no longer pass around an absolute file path. Instead, we separately pass around filename (just the name of the file) and file_module that the file lives in.
sklearn/datasets/tests/test_base.py
Outdated
| assert all([os.path.exists(bunch.get(f, False)) for f in filenames]) | ||
| assert all([ | ||
| f in bunch and 'file_module' in bunch and | ||
| importlib_resources.is_resource(bunch['file_module'], bunch[f]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying the test here to remove checks on the absolute paths.
sklearn/datasets/_base.py
Outdated
| descr = load_descr(IMAGES_MODULE, 'README.txt') | ||
|
|
||
| images_module_traversable = importlib_resources.files(IMAGES_MODULE) | ||
| with importlib_resources.as_file(images_module_traversable) as images_dir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only available python>=3.9, so it could be problematic. So we probably need to find another (hopefully clean) way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the backport library importlib_resources for python<=3.8 as noted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackzyliu Do you know if this is the reason for the failures in the PR? If so shall we try to make sure everything else is working and leave the images for after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackzyliu Do you know if this is the reason for the failures in the PR checks? If so shall we try to make sure everything else is working and tackle the images after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it is.
I will make the images portion backward-compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I a expressed in #20150 (comment) is to avoid use the open() Python builtin anywhere and instead using the functions from importlib.resources.open_binary/text or importlib.resources.read_binary/text documented here:
https://docs.python.org/3/library/importlib.html#module-importlib.resources
The goal is to avoid assuming those resource names are actual filenames of resource stored on a filesystem at any point.
Gotcha. I will make these changes accordingly and make the |
sklearn/datasets/_base.py
Outdated
| target_names[0] is the name of the target[0] class. | ||
| """ | ||
| with open(join(module_path, "data", data_file_name)) as csv_file: | ||
| with importlib_resources.open_text(data_module, data_file_name) as csv_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.open_text has encoding="utf-8" by default (reference)
Happy to make it explicit though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine to rely on the default utf-8 encoding assumption. I am pretty sure that all our CSV file only need the ascii encoding but this is a valid subset of utf-8 as far as I know.
We just needed to ensure that the loader would not implicitly try to use a platform dependent encoding.
sklearn/datasets/_base.py
Outdated
| return data, target, target_names | ||
|
|
||
|
|
||
| def load_gzip_compressed_csv_data(data_module, data_file_name, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can ask np.loadtxt to decompress if we are passing in file object (as opposed to a file name ending with '.gz'). So for the few .csv.gz files, we first manually decompress the file object and the use np.loadtxt to load data as before.
| ("breast_cancer.csv", 569, 30, ["malignant", "benign"]), | ||
| ], | ||
| ) | ||
| def test_load_csv_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding standalone tests for load_csv_data and load_gzip_compressed_csv_data.
sklearn/datasets/_base.py
Outdated
|
|
||
| filenames = list() | ||
| images = list() | ||
| for filename in sorted(importlib_resources.contents(IMAGES_MODULE)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can directly use .contents() (which returns a list of file names and is available since python 3.7) to avoid using .files(). It was my mistake not remembering this option.
|
Just updated according to your comments. Is it more along the line of what you were thinking? I think the failing tests come from main. Will rebase after those are fixed. |
|
At this point, all the conversions are done for For ease of review, I have one separate, mostly self-contained commit for each of the following component:
Let me know of comments.
Others (should probably be separate PR(s) since |
|
Updated whats new, and added a few |
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jackzyliu. This look good overall. Here are a few suggestion for further improvement. Once done I think that would be ready for merge on my end.
sklearn/datasets/_base.py
Outdated
| fdescr = importlib_resources.read_text(descr_module, descr_file_name) | ||
|
|
||
| return fdescr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see most of the descriptions are loaded from DESCR_MODULE so to avoid redundancy we could have:
load_descr(descr_file_name, descr_module=DESCR_MODULE):
that would make it possible to only import load_descr without having to import the DESCR_MODULE constant most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For convenience you could add an optional descr_file_name=None / descr_module=DESCR_MODULE to load_csv_data and load_gzip_compressed_csv_data and when descr_file_name is not None, make those functions return the loaded description as as additional return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see most of the descriptions are loaded from
DESCR_MODULEso to avoid redundancy we could have:load_descr(descr_file_name, descr_module=DESCR_MODULE):that would make it possible to only import
load_descrwithout having to import theDESCR_MODULEconstant most of the time.
By extension, I am applying the same changes to load_csv_data and load_gzip_compressed_csv_data to avoid the same redundancy since these data are always loaded from DATA_MODULE.
load_csv_data(data_file_name, data_module=DATA_MODULE)
And since as positional arguments to importlib.resources functions, the resource module always comes before the resource name, I am enforcing named arguments to avoid usage error.
load_csv_data(data_file_name, *, data_module=DATA_MODULE)
load_descr(descr_file_name, *, descr_module=DESCR_MODULE)
Lastly, combined with the suggestion to add optional descr_file_name=None, descr_module=DESCR_MODULE, we have the following function signature:
load_csv_data(data_file_name, *, data_module=DATA_MODULE, descr_file_name=None, descr_module=DESCR_MODULE)
| gzip_encoding = kwargs["encoding"] | ||
|
|
||
| compressed_file = gzip.open(compressed_file, mode="rt", encoding=gzip_encoding) | ||
| data = np.loadtxt(compressed_file, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we would pass the encoding kwarg to np.loadtxt if it has arleady been used to decode the text of gzip.open.
Either call gzip.open with mode="rb" and only pass encoding to np.loadtxt so that it handles the text decoding correctly itself or pass it only to gzip.open with model="rt" (in which case it would make sense to have load_gzip_compressed_csv_data(data_module, data_file_name, encoding="utf-8", **kwargs) to simplify the code to only forward kwargs to np.loadtxt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Although encoding is passed around a bunch of times in np.loadtxt, it's not really used when the file is already in text.
Going with the latter option.
sklearn/datasets/_base.py
Outdated
| data_file_name = "iris.csv" | ||
| data, target, target_names = load_csv_data(DATA_MODULE, data_file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid the extra indirection in a local variable that is used only once.
| data_file_name = "iris.csv" | |
| data, target, target_names = load_csv_data(DATA_MODULE, data_file_name) | |
| data, target, target_names = load_csv_data(DATA_MODULE, "iris.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_file_name is actually used again in constructing the Bunch below, which is why I used a variable. Do you still want me to make the change?
Edit: I think the same goes to breast_cancer.csv below.
sklearn/datasets/_base.py
Outdated
| data_file_name = "breast_cancer.csv" | ||
| data, target, target_names = load_csv_data(DATA_MODULE, data_file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| data_file_name = "breast_cancer.csv" | |
| data, target, target_names = load_csv_data(DATA_MODULE, data_file_name) | |
| data, target, target_names = load_csv_data(DATA_MODULE, "breast_cancer.csv") |
| assert len(bunch.target_names) == n_target | ||
| if has_descr: | ||
| assert bunch.DESCR | ||
| if filenames: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we expect a non-empty list of filenames, I think we should require the data module to be present in the bunch.
sklearn/datasets/tests/test_base.py
Outdated
| assert all( | ||
| [ | ||
| f in bunch | ||
| and "file_module" in bunch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather use the data_module field in all those Bunch instances:
| and "file_module" in bunch | |
| and "data_module" in bunch |
| ) as modified_gzip: | ||
| original_data_module = OPENML_TEST_DATA_MODULE + "." + f"id_{data_id}" | ||
| original_data_file_name = "data-v1-dl-1666876.arff.gz" | ||
| corrupt_copy_path = os.path.join(tmpdir, "test_invalid_checksum.arff") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I am pretty sure this can be simplified as:
| corrupt_copy_path = os.path.join(tmpdir, "test_invalid_checksum.arff") | |
| corrupt_copy_path = tmpdir / "test_invalid_checksum.arff" |
| import scipy.sparse as sp | ||
| import os | ||
| import shutil | ||
| from importlib import resources as importlib_resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could just use:
| from importlib import resources as importlib_resources | |
| from importlib import resources |
and then call resources.open_binary / open_binary.open_text instead of renaming this module every where.
|
Hi @ogrisel, just updated according to your comments:
Build failure: Please let me know of further comments. Thanks! |
|
Regarding the failure, I don't think this is big deal. We need to be sure to run the test locally (they are passing on my side) but they will be skipped anyway on the CIs. |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This only suggestions for the documentation to follow as much as possible numpydoc style
sklearn/datasets/_base.py
Outdated
| data_file_name : string | ||
| Name of csv file to be loaded from | ||
| module_path/data/data_file_name. For example 'wine_data.csv'. | ||
| data_module/data_file_name. For example 'wine_data.csv'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| data_module/data_file_name. For example 'wine_data.csv'. | |
| `data_module/data_file_name`. For example `'wine_data.csv'`. |
sklearn/datasets/_base.py
Outdated
| data_module : string or module, optional | ||
| module where data lives; | ||
| default "sklearn.datasets.data" (DATA_MODULE). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| data_module : string or module, optional | |
| module where data lives; | |
| default "sklearn.datasets.data" (DATA_MODULE). | |
| data_module : str or module, default="sklearn.datasets.data" | |
| Module where data lives. The default is `"sklearn.datasets.data"`. |
sklearn/datasets/_base.py
Outdated
| module where data lives; | ||
| default "sklearn.datasets.data" (DATA_MODULE). | ||
| descr_file_name : string, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| descr_file_name : string, optional | |
| descr_file_name : str, default=None |
sklearn/datasets/_base.py
Outdated
| (See `load_descr`) Name of rst file to be loaded from | ||
| descr_module/descr_file_name. For example 'wine_data.rst'; | ||
| If not None, also returns the corresponding description of | ||
| the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (See `load_descr`) Name of rst file to be loaded from | |
| descr_module/descr_file_name. For example 'wine_data.rst'; | |
| If not None, also returns the corresponding description of | |
| the dataset. | |
| Name of rst file to be loaded from `descr_module/descr_file_name`. | |
| For example 'wine_data.rst'. See also :func:`load_descr`. | |
| If not None, also returns the corresponding description of | |
| the dataset. |
sklearn/datasets/_base.py
Outdated
| If not None, also returns the corresponding description of | ||
| the dataset. | ||
| descr_module : string or module, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| descr_module : string or module, optional | |
| descr_module : str or module, default="sklearn.datasets.descr" |
sklearn/datasets/_base.py
Outdated
| descr : string, optional | ||
| Description of the dataset (content of descr_file_name). Only returned | ||
| if descr_file_name is not None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you format it as in the previous function.
sklearn/datasets/_base.py
Outdated
| descr_file_name : string | ||
| Name of rst file to be loaded from | ||
| descr_module/descr_file_name. For example 'wine_data.rst'. | ||
| descr_module : string or module, optional | ||
| module where descr_file_name lives; | ||
| default "sklearn.datasets.descr" (DESCR_MODULE). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use the docstring of the previous function as well.
sklearn/datasets/_base.py
Outdated
|
|
||
|
|
||
| def load_descr(descr_file_name, *, descr_module=DESCR_MODULE): | ||
| """Loads descr_file_name from descr_module with importlib.resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Loads descr_file_name from descr_module with importlib.resources. | |
| """Load `descr_file_name` from `descr_module` with `importlib.resources`. |
sklearn/datasets/_base.py
Outdated
| Returns | ||
| ------- | ||
| string; content of descr_file_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| string; content of descr_file_name | |
| fdescr : str | |
| Content of `descr_file_name`. |
sklearn/datasets/_base.py
Outdated
| filenames = list() | ||
| images = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| filenames = list() | |
| images = list() | |
| filenames, images = [], [] |
|
@glemaitre Thanks a lot for the review and suggestions! I have modified the documentations accordingly. Let me know if anything else needs changing. Weirdly, the doc checks both seem have failed at I will wait for further comments before I push to force-run checks again. |
|
I think that you only need to merge |
…r and zipapp) sklearn/datasets/_base.py
- change signatures of underlying load functions to reduce various redundancies - allow `load_descr` within `load_csv_data` and `load_gzip_compressed_csv_data` - undo renaming of `importlib.resources` in import - rename module field in returning `Bunch` - other style improvements
Thanks! All green now. |
rth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jackzyliu ! LGTM as well. Merging with two approvals above.
nithish08
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(compat with pyOxidizer and zipapp)
(compat with pyOxidizer and zipapp)
Reference Issues/PRs
Closes #20081
helping #20150
What does this implement/fix? Explain your changes.
__file__withimportlib.resourcesinsklearn/datasets/*pyOxidizerandzipapp:importlib.resources.{open, read}_{binary, text}importlib.resources.pathto avoid making the assumption that resources already live in a filesystemAny other comments?
sklearn/datasets:Other (should probably be separate PR(s) since
__file__is used slightly differently in these occurrences):