-
Notifications
You must be signed in to change notification settings - Fork 5.2k
S3 artifact store: fix path resolution error when artifact root is bucket root #928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mlflow/store/s3_artifact_repo.py
Outdated
| infos.append(FileInfo(name, False, size)) | ||
| file_path = obj.get("Key") | ||
| if not file_path.startswith(artifact_path): | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is somewhat challenging to test that this is raised because the S3 client paginator is difficult to mock. If we think that this test case is particularly important, I can try to figure something out.
mlflow/store/s3_artifact_repo.py
Outdated
| infos.append(FileInfo(subdir, True, None)) | ||
| subdir_path = obj.get("Prefix") | ||
| if not subdir_path.startswith(artifact_path): | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
Why raise a ValueError? can we raise MlflowException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to MlflowException!
mlflow/store/s3_artifact_repo.py
Outdated
| if not file_path.startswith(artifact_path): | ||
| raise ValueError( | ||
| "The path of the listed S3 file does not begin with the specified" | ||
| " artifact path. Artifact path: {artifact_path}. File path:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way this duplicated code can be unified? I know this is extracted differently - one is a file vs directory. In fact, if there is a way share some of this between different blob stores -- would be awesome .... so any changes to internal methods don't require changes all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I've deduped the code inside S3ArtifactRepository by defining a static _verify_listed_object_contains_artifact_path_prefix() method. It may be reasonable to create an abstract class for blob stores that follow this pattern, but it seems we agree that this may be a larger undertaking that requires a followup PR.
| s3_client = self._get_s3_client() | ||
| paginator = s3_client.get_paginator("list_objects_v2") | ||
| results = paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/') | ||
| for result in results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's write some sort of a checker to make sure that nesting is legit. For instance it is possible to have this sort of directory structure in S3
dir_name # is a true directory
dir_name/sub_dir # this one is a file
dir_name/sub_dir/file # this is also a file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After offline discussion, it appears that we don't currently have a mechanism for checking / enforcing that S3 object keys containing slashes are directories. We should definitely agree on a strategy for dealing with files whose keys contain slashes, but this is a bit outside the scope of the current PR.
|
You'll have to resolve conflict. But the changes LGTM. Thanks for deduping the code and migrating exception raised. |
When attempting to download a directory artifact from an S3-based artifact repository, paths are truncated if the repository's artifact URI is the URI of the S3 bucket root. This PR fixes the issue by using
relpath()rather than filename slicing.The cause of failure is identical to the one that affected in the Azure store and was addressed by #769.