-
Notifications
You must be signed in to change notification settings - Fork 410
[Spark] Normalize dataset names with configurable trimmers #3996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7e6cc89 to
aed8d72
Compare
4317bb2 to
77abf21
Compare
dff7a69 to
de15546
Compare
d3c5629 to
41a3ac1
Compare
17dc460 to
4fcdaec
Compare
| ... | ||
| "inputs": { | ||
| "facets": { | ||
| "dataSource": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably should be datasetSubset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dolfinus This file shall be removed. I'll create separate PR to document subset definition facets. Thanks for spotting this.
| } | ||
|
|
||
| /** | ||
| * Returns the last path of the dataset name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is last path of dataset name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a dataset name is a path, then the last path should return trailing directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe path -> part?
That's what Python: https://docs.python.org/3/library/pathlib.html#accessing-individual-parts
and Java https://docs.oracle.com/javase/8/docs/api/java/nio/file/Paths.html
use.
| } | ||
|
|
||
| /** | ||
| * Given another dataset, it reduces and merges the two datasets. If the paths of the two datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by reducing? What is trimming vs reducing here? How are those getting merged? I think those terms need to be clearly defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trimming is a string operation and trimms dataset name. Reducing works on a collection of datasets. If a collection has dataset with the same trimmed name, and the same facets, those datasets should be merged into a single dataset with a trimmed name. Merging process is a reducer. I'll add more comments in the code.
| /** | ||
| * Trims directory if last path is a string represents a date in a format /yyyy/MM or /yyyy/MM/dd | ||
| */ | ||
| public class MultiDirTrimmer implements DatasetNameTrimmer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public class MultiDirTrimmer implements DatasetNameTrimmer { | |
| public class MultiDirDateTrimmer implements DatasetNameTrimmer { |
Current name can be misleading
| .map(((Class<InputDataset>) InputDataset.class)::cast)) | ||
| .orElse(Stream.empty())) | ||
| .collect(Collectors.toList()); | ||
| datasets = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but it would be amazing if someday we could refactor the integration to something that has well-defined places where extraction and later processing of individual parts of the eventual run event takes place. Right now we're doing "random" modifications and sticking them in "random" places in the codebase.
This is confusing to someone who wants to either add/modify handling particular logical plan node, as it's not well defined what will happen to the results of that change, and to someone who wants to apply some more global processing. Also, it's even confusing to someone who has seen this code a decent amount of time (like me). To go down few levels a bit, it would be best if OpenLineageRunEventBuilder class would not do any processing, but just assemble the event with given parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion!
…S3 locations) Signed-off-by: Pawel Leszczynski <[email protected]>
0be4ac1 to
54656f1
Compare
|
@dolfinus @mobuchowski I applied changes from your comments within the second commit. |
|
|
||
| @Override | ||
| public boolean canTrim(String name) { | ||
| String lastPart = getLastPart(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this trimmer can detect the first part containing = symbol, and trim it with the rest of the string, instead of iterating over last path part until everything is trimmed. But current approach works as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mechanism needs to support the mixed trimming like /path/key=value/20250930/key2=value2, so for the sake of simplicity a trimmer has only to trim the last part. At least, I didn't want to implement same thing for all the trimmers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiDirDateTrimmer already works with several parts, not just last one.
My concern is that for path like /path/key=value/randomvalue/key2=value2 dataset name will be /path/key=value/randomvalue, not /path. But this is not a real case, IMHO the whole mixed naming conventions for paths (both partition by key=value and yyyy/mm/dd) is something that can't appear in real paths, it's either one or another. But I may be wrong :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, /path/key=value/not-date-value/key2=value2 will result in /path/key=value/not-date-value. This is intentional as I did want to trim parts a mechanism does not understand. Let's leave it for later ;)
But, /path/key=value/20250930/key2=value2 will be trimmed to /path - this is implemented within trimDatasetName of io.openlineage.client.dataset.partition.ReducedDataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's keep current implementation for now
.../java/src/test/java/io/openlineage/client/dataset/partition/trimmer/KeyValueTrimmerTest.java
Show resolved
Hide resolved
54656f1 to
99eb0bf
Compare
Signed-off-by: Pawel Leszczynski <[email protected]>
99eb0bf to
30013b7
Compare
client/java/src/main/java/io/openlineage/client/dataset/partition/DatasetReducer.java
Show resolved
Hide resolved
| // one with each other | ||
| List<ReducedDataset> reducedDatasets = new ArrayList<>(); | ||
| for (List<ReducedDataset> sameNameList : toReduce.values()) { | ||
| AtomicBoolean reducedSomething = new AtomicBoolean(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is atomic variable required? It's not a object or class level attribute, so how other threads can access it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it's not required
| } | ||
|
|
||
| // get three last parts and verify if they're a date | ||
| String lastTwoParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| String lastTwoParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1]; | |
| String lastThreeParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1]; |
| } | ||
| return openLineage | ||
| .newOutputDatasetBuilder() | ||
| .name(r.getTrimmedDatasetName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trimming should probably be applied to columnLineage facet as well, to keep valid references to input dataset. Maybe in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great catch. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll prepare another PR for that.
|
|
||
| @Test | ||
| void testTrim() { | ||
| assertThat(trimmer.canTrim("/tmp/20250721")).isTrue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add test cases when dataset name doesn't start with / (S3 object name) and doesn't contain / at all (table names, some may include date as a part of name)
| new DatasetReducer( | ||
| openLineageContext.getOpenLineage(), | ||
| openLineageContext.getOpenLineageConfig().getDatasetConfig()) | ||
| .reduceInputs(datasets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR title and description should be adjusted to mention that not object storage are handled here - S3 is an object storage, HDFS is not, both are trimmed.
Trimmers are applied to all dataset names, even if they don't contain any paths, like table name, but in this case dataset name is left intact.
51b8ac6 to
5196219
Compare
|
|
||
| @Test | ||
| void testTrimNonPath() { | ||
| assertThat(trimmer.canTrim("tmp/20250721")).isFalse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So trimmers are no applied to object storage paths which not start with /, e.g. S3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot S3 name does not start with /.
I changed the trimmers, so that canTrim returns false if trimming resulted in empty string.
This covers table names which may match trimming conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but please add tests fro S3 paths which are not trimmed to empty string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, makes sense. I added that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Pawel Leszczynski <[email protected]>
5196219 to
36a7eb3
Compare
Signed-off-by: Pawel Leszczynski <[email protected]>
dolfinus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pawel-big-lebowski Thanks, nice feature!
The PR introduces subset definition facets proposed in partitioning proposal
It implements logic of merging input datasets for Spark. For example, input datasets:
/tmp/some/tested-path/20250721/20250722T901Z/key=value/tmp/some/tested-path/20250720/20250721T901Z/key=value"will be merged into
/tmp/some/tested-pathdataset with a subset definition facet containing list of locations.Merge will happen only if datasets have all the facets the same. This can solve the problem of extensive OpenLinaege events.
Missing things on the PR are noted with
TODOcomment.