[Spark] Normalize dataset names with configurable trimmers #3996

pawel-big-lebowski · 2025-08-26T11:32:22Z

The PR introduces subset definition facets proposed in partitioning proposal

It implements logic of merging input datasets for Spark. For example, input datasets:
/tmp/some/tested-path/20250721/20250722T901Z/key=value
/tmp/some/tested-path/20250720/20250721T901Z/key=value"
will be merged into /tmp/some/tested-path dataset with a subset definition facet containing list of locations.

Merge will happen only if datasets have all the facets the same. This can solve the problem of extensive OpenLinaege events.

Missing things on the PR are noted with TODO comment.

dolfinus · 2025-09-26T17:55:20Z

website/docs/spec/facets/dataset-facets/subset_definition.md

+    ...
+    "inputs": {
+        "facets": {
+            "dataSource": {


This probably should be datasetSubset

@dolfinus This file shall be removed. I'll create separate PR to document subset definition facets. Thanks for spotting this.

mobuchowski · 2025-09-29T12:12:16Z

...t/java/src/main/java/io/openlineage/client/dataset/partition/trimmer/DatasetNameTrimmer.java

+  }
+
+  /**
+   * Returns the last path of the dataset name.


What is last path of dataset name?

If a dataset name is a path, then the last path should return trailing directory.

Maybe path -> part?
That's what Python: https://docs.python.org/3/library/pathlib.html#accessing-individual-parts
and Java https://docs.oracle.com/javase/8/docs/api/java/nio/file/Paths.html
use.

mobuchowski · 2025-09-29T12:13:10Z

client/java/src/main/java/io/openlineage/client/dataset/partition/ReducedDataset.java

+  }
+
+  /**
+   * Given another dataset, it reduces and merges the two datasets. If the paths of the two datasets


What do you mean by reducing? What is trimming vs reducing here? How are those getting merged? I think those terms need to be clearly defined.

trimming is a string operation and trimms dataset name. Reducing works on a collection of datasets. If a collection has dataset with the same trimmed name, and the same facets, those datasets should be merged into a single dataset with a trimmed name. Merging process is a reducer. I'll add more comments in the code.

dolfinus · 2025-09-29T12:34:44Z

client/java/src/main/java/io/openlineage/client/dataset/partition/trimmer/MultiDirTrimmer.java

+/**
+ * Trims directory if last path is a string represents a date in a format /yyyy/MM or /yyyy/MM/dd
+ */
+public class MultiDirTrimmer implements DatasetNameTrimmer {


Suggested change

public class MultiDirTrimmer implements DatasetNameTrimmer {

public class MultiDirDateTrimmer implements DatasetNameTrimmer {

Current name can be misleading

mobuchowski · 2025-09-29T13:43:30Z

...rk/shared/src/main/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java

                                .map(((Class<InputDataset>) InputDataset.class)::cast))
                    .orElse(Stream.empty()))
            .collect(Collectors.toList());
+    datasets =


Not for this PR, but it would be amazing if someday we could refactor the integration to something that has well-defined places where extraction and later processing of individual parts of the eventual run event takes place. Right now we're doing "random" modifications and sticking them in "random" places in the codebase.

This is confusing to someone who wants to either add/modify handling particular logical plan node, as it's not well defined what will happen to the results of that change, and to someone who wants to apply some more global processing. Also, it's even confusing to someone who has seen this code a decent amount of time (like me). To go down few levels a bit, it would be best if OpenLineageRunEventBuilder class would not do any processing, but just assemble the event with given parts.

Great suggestion!

…S3 locations) Signed-off-by: Pawel Leszczynski <[email protected]>

pawel-big-lebowski · 2025-09-30T11:53:38Z

@dolfinus @mobuchowski I applied changes from your comments within the second commit.

dolfinus · 2025-09-30T12:12:15Z

client/java/src/main/java/io/openlineage/client/dataset/partition/trimmer/KeyValueTrimmer.java

+
+  @Override
+  public boolean canTrim(String name) {
+    String lastPart = getLastPart(name);


IMHO this trimmer can detect the first part containing = symbol, and trim it with the rest of the string, instead of iterating over last path part until everything is trimmed. But current approach works as well

The mechanism needs to support the mixed trimming like /path/key=value/20250930/key2=value2, so for the sake of simplicity a trimmer has only to trim the last part. At least, I didn't want to implement same thing for all the trimmers.

MultiDirDateTrimmer already works with several parts, not just last one.

My concern is that for path like /path/key=value/randomvalue/key2=value2 dataset name will be /path/key=value/randomvalue, not /path. But this is not a real case, IMHO the whole mixed naming conventions for paths (both partition by key=value and yyyy/mm/dd) is something that can't appear in real paths, it's either one or another. But I may be wrong :)

Yes, /path/key=value/not-date-value/key2=value2 will result in /path/key=value/not-date-value. This is intentional as I did want to trim parts a mechanism does not understand. Let's leave it for later ;)

But, /path/key=value/20250930/key2=value2 will be trimmed to /path - this is implemented within trimDatasetName of io.openlineage.client.dataset.partition.ReducedDataset

Ok, let's keep current implementation for now

.../java/src/test/java/io/openlineage/client/dataset/partition/trimmer/KeyValueTrimmerTest.java

Signed-off-by: Pawel Leszczynski <[email protected]>

client/java/src/main/java/io/openlineage/client/dataset/partition/DatasetReducer.java

dolfinus · 2025-09-30T16:52:04Z

client/java/src/main/java/io/openlineage/client/dataset/partition/DatasetReducer.java

+    // one with each other
+    List<ReducedDataset> reducedDatasets = new ArrayList<>();
+    for (List<ReducedDataset> sameNameList : toReduce.values()) {
+      AtomicBoolean reducedSomething = new AtomicBoolean(true);


Is atomic variable required? It's not a object or class level attribute, so how other threads can access it?

no, it's not required

dolfinus · 2025-09-30T17:00:25Z

.../java/src/main/java/io/openlineage/client/dataset/partition/trimmer/MultiDirDateTrimmer.java

+    }
+
+    // get three last parts and verify if they're a date
+    String lastTwoParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1];


Suggested change

String lastTwoParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1];

String lastThreeParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1];

dolfinus · 2025-09-30T17:07:45Z

client/java/src/main/java/io/openlineage/client/dataset/partition/DatasetReducer.java

+              }
+              return openLineage
+                  .newOutputDatasetBuilder()
+                  .name(r.getTrimmedDatasetName())


Trimming should probably be applied to columnLineage facet as well, to keep valid references to input dataset. Maybe in another PR

That's a great catch. Thank you!

I'll prepare another PR for that.

dolfinus · 2025-09-30T17:09:51Z

client/java/src/test/java/io/openlineage/client/dataset/partition/trimmer/DateTrimmerTest.java

+
+  @Test
+  void testTrim() {
+    assertThat(trimmer.canTrim("/tmp/20250721")).isTrue();


Please add test cases when dataset name doesn't start with / (S3 object name) and doesn't contain / at all (table names, some may include date as a part of name)

dolfinus · 2025-09-30T17:16:59Z

...rk/shared/src/main/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java

+        new DatasetReducer(
+                openLineageContext.getOpenLineage(),
+                openLineageContext.getOpenLineageConfig().getDatasetConfig())
+            .reduceInputs(datasets);


PR title and description should be adjusted to mention that not object storage are handled here - S3 is an object storage, HDFS is not, both are trimmed.
Trimmers are applied to all dataset names, even if they don't contain any paths, like table name, but in this case dataset name is left intact.

dolfinus · 2025-10-01T11:07:54Z

client/java/src/test/java/io/openlineage/client/dataset/partition/trimmer/DateTrimmerTest.java

+
+  @Test
+  void testTrimNonPath() {
+    assertThat(trimmer.canTrim("tmp/20250721")).isFalse();


So trimmers are no applied to object storage paths which not start with /, e.g. S3?

Oh, I forgot S3 name does not start with /.
I changed the trimmers, so that canTrim returns false if trimming resulted in empty string.
This covers table names which may match trimming conditions.

Ok, but please add tests fro S3 paths which are not trimmed to empty string

ok, makes sense. I added that.

Signed-off-by: Pawel Leszczynski <[email protected]>

dolfinus

@pawel-big-lebowski Thanks, nice feature!

boring-cyborg bot added area:client/java openlineage-java area:client/python openlineage-python area:spec Specifications and standards for the project area:tests Testing code language:java Uses Java programming language language:python Uses Python programming language labels Aug 26, 2025

pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 5 times, most recently from 7e6cc89 to aed8d72 Compare August 26, 2025 14:29

boring-cyborg bot added the area:integration/spark label Aug 26, 2025

JDarDagran force-pushed the spec/subset-facet-in-one-schema-file branch 3 times, most recently from 4317bb2 to 77abf21 Compare August 27, 2025 10:34

pawel-big-lebowski mentioned this pull request Aug 27, 2025

add subset dataset facets to spec #3995

Closed

pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 2 times, most recently from dff7a69 to de15546 Compare August 27, 2025 12:53

JDarDagran mentioned this pull request Aug 27, 2025

Python: allow type aliases #4000

Merged

8 tasks

pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 7 times, most recently from d3c5629 to 41a3ac1 Compare August 28, 2025 09:22

boring-cyborg bot added the area:integration/flink label Aug 28, 2025

pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 3 times, most recently from 17dc460 to 4fcdaec Compare August 28, 2025 12:14

pawel-big-lebowski marked this pull request as ready for review September 26, 2025 14:20

pawel-big-lebowski requested a review from a team as a code owner September 26, 2025 14:20

dolfinus reviewed Sep 26, 2025

View reviewed changes

mobuchowski reviewed Sep 29, 2025

View reviewed changes

dolfinus reviewed Sep 29, 2025

View reviewed changes

mobuchowski reviewed Sep 29, 2025

View reviewed changes

Detect subset locations facet for S3 objects in Spark (aka normalize …

fa6afce

…S3 locations) Signed-off-by: Pawel Leszczynski <[email protected]>

pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch 3 times, most recently from 0be4ac1 to 54656f1 Compare September 30, 2025 11:15

dolfinus reviewed Sep 30, 2025

View reviewed changes

.../java/src/test/java/io/openlineage/client/dataset/partition/trimmer/KeyValueTrimmerTest.java Show resolved Hide resolved

pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch from 54656f1 to 99eb0bf Compare September 30, 2025 12:23

code review feedback

30013b7

Signed-off-by: Pawel Leszczynski <[email protected]>

pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch from 99eb0bf to 30013b7 Compare September 30, 2025 13:11

dolfinus reviewed Sep 30, 2025

View reviewed changes

pawel-big-lebowski changed the title ~~[Spark] Support input dataset partitions for object storage locations~~ [Spark] Normalize dataset names via configurable trimmers Oct 1, 2025

pawel-big-lebowski changed the title ~~[Spark] Normalize dataset names via configurable trimmers~~ [Spark] Normalize dataset names with configurable trimmers Oct 1, 2025

pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch from 51b8ac6 to 5196219 Compare October 1, 2025 10:29

dolfinus reviewed Oct 1, 2025

View reviewed changes

code review feedback part 2

36a7eb3

Signed-off-by: Pawel Leszczynski <[email protected]>

pawel-big-lebowski force-pushed the spec/subset-facet-in-one-schema-file branch from 5196219 to 36a7eb3 Compare October 1, 2025 11:48

test names that don't start with slash

059a625

Signed-off-by: Pawel Leszczynski <[email protected]>

dolfinus approved these changes Oct 1, 2025

View reviewed changes

mobuchowski approved these changes Oct 1, 2025

View reviewed changes

pawel-big-lebowski merged commit 2919f51 into main Oct 1, 2025
44 checks passed

pawel-big-lebowski deleted the spec/subset-facet-in-one-schema-file branch October 1, 2025 15:15

	public class MultiDirTrimmer implements DatasetNameTrimmer {
	public class MultiDirDateTrimmer implements DatasetNameTrimmer {

	String lastTwoParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1];
	String lastThreeParts = dirs[dirs.length - 3] + dirs[dirs.length - 2] + dirs[dirs.length - 1];

[Spark] Normalize dataset names with configurable trimmers #3996

[Spark] Normalize dataset names with configurable trimmers #3996

Uh oh!

Conversation

pawel-big-lebowski commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dolfinus Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawel-big-lebowski commented Sep 30, 2025

Uh oh!

dolfinus Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dolfinus Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawel-big-lebowski Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dolfinus Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dolfinus Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dolfinus Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dolfinus Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dolfinus Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

pawel-big-lebowski commented Aug 26, 2025 •

edited

Loading

dolfinus Sep 29, 2025 •

edited

Loading

dolfinus Sep 30, 2025 •

edited

Loading

dolfinus Sep 30, 2025 •

edited

Loading

pawel-big-lebowski Sep 30, 2025 •

edited

Loading

dolfinus Sep 30, 2025 •

edited

Loading

dolfinus Sep 30, 2025 •

edited

Loading

dolfinus Sep 30, 2025 •

edited

Loading

dolfinus Oct 1, 2025 •

edited

Loading

dolfinus Oct 1, 2025 •

edited

Loading