Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Aug 26, 2025

The PR introduces subset definition facets proposed in partitioning proposal

It implements logic of merging input datasets for Spark. For example, input datasets:
/tmp/some/tested-path/20250721/20250722T901Z/key=value
/tmp/some/tested-path/20250720/20250721T901Z/key=value"
will be merged into /tmp/some/tested-path dataset with a subset definition facet containing list of locations.

Merge will happen only if datasets have all the facets the same. This can solve the problem of extensive OpenLinaege events.

Missing things on the PR are noted with TODO comment.

@boring-cyborg boring-cyborg bot added area:client/java openlineage-java area:client/python openlineage-python area:spec Specifications and standards for the project area:tests Testing code language:java Uses Java programming language language:python Uses Python programming language labels Aug 26, 2025
@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 5 times, most recently from 7e6cc89 to aed8d72 Compare August 26, 2025 14:29
@JDarDagran JDarDagran force-pushed the spec/subset-facet-in-one-schema-file branch 3 times, most recently from 4317bb2 to 77abf21 Compare August 27, 2025 10:34
@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 2 times, most recently from dff7a69 to de15546 Compare August 27, 2025 12:53
@JDarDagran JDarDagran mentioned this pull request Aug 27, 2025
8 tasks
@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 7 times, most recently from d3c5629 to 41a3ac1 Compare August 28, 2025 09:22
@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 3 times, most recently from 17dc460 to 4fcdaec Compare August 28, 2025 12:14
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review September 26, 2025 14:20
@pawel-big-lebowski pawel-big-lebowski requested a review from a team as a code owner September 26, 2025 14:20
...
"inputs": {
"facets": {
"dataSource": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should be datasetSubset

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dolfinus This file shall be removed. I'll create separate PR to document subset definition facets. Thanks for spotting this.

}

/**
* Returns the last path of the dataset name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is last path of dataset name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a dataset name is a path, then the last path should return trailing directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

/**
* Given another dataset, it reduces and merges the two datasets. If the paths of the two datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by reducing? What is trimming vs reducing here? How are those getting merged? I think those terms need to be clearly defined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trimming is a string operation and trimms dataset name. Reducing works on a collection of datasets. If a collection has dataset with the same trimmed name, and the same facets, those datasets should be merged into a single dataset with a trimmed name. Merging process is a reducer. I'll add more comments in the code.

/**
* Trims directory if last path is a string represents a date in a format /yyyy/MM or /yyyy/MM/dd
*/
public class MultiDirTrimmer implements DatasetNameTrimmer {
Copy link
Contributor

@dolfinus dolfinus Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class MultiDirTrimmer implements DatasetNameTrimmer {
public class MultiDirDateTrimmer implements DatasetNameTrimmer {

Current name can be misleading

.map(((Class<InputDataset>) InputDataset.class)::cast))
.orElse(Stream.empty()))
.collect(Collectors.toList());
datasets =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but it would be amazing if someday we could refactor the integration to something that has well-defined places where extraction and later processing of individual parts of the eventual run event takes place. Right now we're doing "random" modifications and sticking them in "random" places in the codebase.

This is confusing to someone who wants to either add/modify handling particular logical plan node, as it's not well defined what will happen to the results of that change, and to someone who wants to apply some more global processing. Also, it's even confusing to someone who has seen this code a decent amount of time (like me). To go down few levels a bit, it would be best if OpenLineageRunEventBuilder class would not do any processing, but just assemble the event with given parts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion!

@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 3 times, most recently from 0be4ac1 to 54656f1 Compare September 30, 2025 11:15
@pawel-big-lebowski
Copy link
Collaborator Author

@dolfinus @mobuchowski I applied changes from your comments within the second commit.


@Override
public boolean canTrim(String name) {
String lastPart = getLastPart(name);
Copy link
Contributor

@dolfinus dolfinus Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this trimmer can detect the first part containing = symbol, and trim it with the rest of the string, instead of iterating over last path part until everything is trimmed. But current approach works as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mechanism needs to support the mixed trimming like /path/key=value/20250930/key2=value2, so for the sake of simplicity a trimmer has only to trim the last part. At least, I didn't want to implement same thing for all the trimmers.

Copy link
Contributor

@dolfinus dolfinus Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiDirDateTrimmer already works with several parts, not just last one.

My concern is that for path like /path/key=value/randomvalue/key2=value2 dataset name will be /path/key=value/randomvalue, not /path. But this is not a real case, IMHO the whole mixed naming conventions for paths (both partition by key=value and yyyy/mm/dd) is something that can't appear in real paths, it's either one or another. But I may be wrong :)

Copy link
Collaborator Author

@pawel-big-lebowski pawel-big-lebowski Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, /path/key=value/not-date-value/key2=value2 will result in /path/key=value/not-date-value. This is intentional as I did want to trim parts a mechanism does not understand. Let's leave it for later ;)

But, /path/key=value/20250930/key2=value2 will be trimmed to /path - this is implemented within trimDatasetName of io.openlineage.client.dataset.partition.ReducedDataset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's keep current implementation for now

@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch from 54656f1 to 99eb0bf Compare September 30, 2025 12:23
Signed-off-by: Pawel Leszczynski <[email protected]>
@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch from 99eb0bf to 30013b7 Compare September 30, 2025 13:11
// one with each other
List<ReducedDataset> reducedDatasets = new ArrayList<>();
for (List<ReducedDataset> sameNameList : toReduce.values()) {
AtomicBoolean reducedSomething = new AtomicBoolean(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is atomic variable required? It's not a object or class level attribute, so how other threads can access it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's not required

}

// get three last parts and verify if they're a date
String lastTwoParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String lastTwoParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1];
String lastThreeParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1];

}
return openLineage
.newOutputDatasetBuilder()
.name(r.getTrimmedDatasetName())
Copy link
Contributor

@dolfinus dolfinus Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimming should probably be applied to columnLineage facet as well, to keep valid references to input dataset. Maybe in another PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great catch. Thank you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll prepare another PR for that.


@Test
void testTrim() {
assertThat(trimmer.canTrim("/tmp/20250721")).isTrue();
Copy link
Contributor

@dolfinus dolfinus Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test cases when dataset name doesn't start with / (S3 object name) and doesn't contain / at all (table names, some may include date as a part of name)

new DatasetReducer(
openLineageContext.getOpenLineage(),
openLineageContext.getOpenLineageConfig().getDatasetConfig())
.reduceInputs(datasets);
Copy link
Contributor

@dolfinus dolfinus Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title and description should be adjusted to mention that not object storage are handled here - S3 is an object storage, HDFS is not, both are trimmed.
Trimmers are applied to all dataset names, even if they don't contain any paths, like table name, but in this case dataset name is left intact.

@pawel-big-lebowski pawel-big-lebowski changed the title [Spark] Support input dataset partitions for object storage locations [Spark] Normalize dataset names via configurable trimmers Oct 1, 2025
@pawel-big-lebowski pawel-big-lebowski changed the title [Spark] Normalize dataset names via configurable trimmers [Spark] Normalize dataset names with configurable trimmers Oct 1, 2025
@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch from 51b8ac6 to 5196219 Compare October 1, 2025 10:29

@Test
void testTrimNonPath() {
assertThat(trimmer.canTrim("tmp/20250721")).isFalse();
Copy link
Contributor

@dolfinus dolfinus Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So trimmers are no applied to object storage paths which not start with /, e.g. S3?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot S3 name does not start with /.
I changed the trimmers, so that canTrim returns false if trimming resulted in empty string.
This covers table names which may match trimming conditions.

Copy link
Contributor

@dolfinus dolfinus Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but please add tests fro S3 paths which are not trimmed to empty string

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, makes sense. I added that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Pawel Leszczynski <[email protected]>
@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch from 5196219 to 36a7eb3 Compare October 1, 2025 11:48
Copy link
Contributor

@dolfinus dolfinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawel-big-lebowski Thanks, nice feature!

@pawel-big-lebowski pawel-big-lebowski merged commit 2919f51 into main Oct 1, 2025
44 checks passed
@pawel-big-lebowski pawel-big-lebowski deleted the spec/subset-facet-in-one-schema-file branch October 1, 2025 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:client/java openlineage-java area:client/python openlineage-python area:documentation Improvements or additions to documentation area:integration/flink area:integration/spark area:spec Specifications and standards for the project area:tests Testing code language:java Uses Java programming language language:python Uses Python programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants