Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@nithish08
Copy link
Contributor

Fixes #20081

This PR removes the usage of file.

@glemaitre glemaitre changed the title Remove __file__ MNT remove __file__ and use importlib.resource instead May 28, 2021
@glemaitre glemaitre self-requested a review May 28, 2021 12:17
@cmarmo
Copy link
Contributor

cmarmo commented Jun 4, 2021

Hi @nithish08, thanks for your pull request. A number of tests are failing with error

E   IsADirectoryError: [Errno 21] Is a directory: '/home/vsts/work/1/s/sklearn/datasets'

It seems like the code tries to open a directory not a file... Could you please have a look? Thanks!

@nithish08
Copy link
Contributor Author

Hi @cmarmo, I am unsure of how to solve this error. Can you please point me in the right direction?

@glemaitre
Copy link
Member

glemaitre commented Jun 7, 2021 via email

@cmarmo
Copy link
Contributor

cmarmo commented Jun 8, 2021

Hi @cmarmo, I am unsure of how to solve this error. Can you please point me in the right direction?

Hi @nithish08 , apparently importlib.resources.path need to open a file not a directory. So you need to point it to a file then cut the directory path.
Something like (for example in _base.py the load_iris() function)

from os.path import join, dirname
with resources.path('sklearn.datasets', '__init__.py') as f:
        module_path = dirname(f)
data, target, target_names = load_data(module_path, 'iris.csv')
iris_csv_filename = join(module_path, 'data', 'iris.csv')

is working even if I'm not sure is totally correct...

@ogrisel
Copy link
Member

ogrisel commented Jun 8, 2021

Maybe the code snippet proposed by @cmarmo could be factorized into a small private helper function if it needs to be repeated in many places of the scikit-learn code base.

@jackzyliu
Copy link
Contributor

jackzyliu commented Jun 8, 2021

Hi, hope you don't mind me chiming in here. I am a little curious about what the purpose of replacing __file__ with importlib.resources is if we plan to 1) use .path to point to another random file, 2) get the dir name of that path, and then 3) use the path outside of the context manager?

I have done a similar refactor for another project before but solely for the purpose of enabling the codespace as a zip executable archive (.pyz). However, since we plan to use the path outside of the context manager here, the code actually wouldn't work in that scenario. So I think there is definitely some other consideration I am missing here. Hope to learn what it is.

@quendee
Copy link

quendee commented Jun 17, 2021

@nithish08 I see that this is still open. Are you planning on continuing working on this? Or can I give it a go?

@ogrisel
Copy link
Member

ogrisel commented Jun 17, 2021

@jackzyliu that's a valid concern. It would be interesting to see if we could use importlib.resources to actually get a file object on the resource we actually want to open when loading datasets instead of assuming that this is always backed by a filesystem folder with data files.

This would make it possible to use the scikit-learn dataset when scikit-learn is embedded in a PyOxidizer-generated binary for instance as @rth commented in #20081 (comment). This would mean changing the load_data function instead.

@jackzyliu
Copy link
Contributor

jackzyliu commented Jun 17, 2021

@jackzyliu that's a valid concern. It would be interesting to see if we could use importlib.resources to actually get a file object on the file we actually want to open when loading datasets instead of assuming that this is always backed by a filesystem folder with data files.

This would make it possible to use the scikit-learn dataset when scikit-learn is embedded in a PyOxidizer-generated binary for instance as @rth commented in #20081 (comment). This would mean changing the load_data function instead.

@ogrisel @rth I completely agree. I was doing a version aimed to be compatible with pyOxidizer and zipapp a few weeks back but only finished _base.py. Let me do a separate PR to show you what I have so far, so that we can discuss 1) whether the code change makes sense to you and 2) how to actually test compatibility with pyOxidizer and zipapp(probably something like this).

A slightly thorny issue is that functions like load_sample_images may require importlib.resources.files which I think is only available after python 3.9. So we will need to use its backport library importlib_resources instead. Is that something you are comfortable with?

Edit: Just put up PR #20297 for discussion. @nithish08 @quendee feel free to chime in and/or apply similar changes to other files.

@ogrisel
Copy link
Member

ogrisel commented Jun 18, 2021

A slightly thorny issue is that functions like load_sample_images may require importlib.resources.files which I think is only available after python 3.9. So we will need to use its backport library importlib_resources instead. Is that something you are comfortable with?

I would rather not add a dependency. Instead write code that uses Traversable-only constructs whenever possible if the Python version is recent enough and fallback to filesystem calls otherwise with a note that explains that in the future we can remove the filesystem branch of the code once we no longer support Python 3.8.

@nithish08
Copy link
Contributor Author

@quendee Please work on this issue if you would like to.

@quendee
Copy link

quendee commented Jun 20, 2021

@nithish08 I think @jackzyliu is working on this following @ogrisel suggestion.

@glemaitre
Copy link
Member

closing in favor of #20297

@glemaitre glemaitre closed this Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

__file__ should be avoided in library code

6 participants