[IcebergIO] Support hash distribution mode when writing rows#38061
[IcebergIO] Support hash distribution mode when writing rows#38061ahmedabu98 wants to merge 8 commits intoapache:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the IcebergIO sink by adding an optional feature to group rows by partition before writing them to the destination. This change is designed to optimize performance and reduce the creation of small files in partitioned Iceberg tables. The implementation introduces new transforms and utility classes to handle the grouping and writing logic, while also updating the existing API and test suites to support this new configuration. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Assigning reviewers: R: @claudevdm for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #38061 +/- ##
=============================================
+ Coverage 54.61% 58.51% +3.89%
- Complexity 1689 15428 +13739
=============================================
Files 1067 2851 +1784
Lines 168152 280076 +111924
Branches 1226 12332 +11106
=============================================
+ Hits 91835 163873 +72038
- Misses 74118 109777 +35659
- Partials 2199 6426 +4227
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Reminder, please take a look at this pr: @claudevdm @chamikaramj |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new distribution mode for Iceberg writes, allowing rows to be shuffled by partition key before writing to reduce the number of small files. It includes the implementation of AssignDestinationsAndPartitions and WriteToPartitions transforms, along with a BeamRowWrapper for Iceberg's StructLike interface. Feedback focuses on critical serialization issues in AssignDoFn, where non-serializable maps must be marked transient and initialized in @Setup. Additionally, improvements are suggested for resource management in WritePartitionedRowsToFiles using try-finally blocks and optimizing the table cache with double-checked locking.
| private final Map<String, PartitionKey> partitionKeys = new HashMap<>(); | ||
| private transient @MonotonicNonNull Map<String, BeamRowWrapper> wrappers; |
There was a problem hiding this comment.
The partitionKeys map contains PartitionKey objects which are not Serializable. Since DoFn instances are serialized to be distributed to workers, this will cause a NotSerializableException at pipeline submission or execution time. This map should be marked transient and initialized in the @Setup method, similar to the wrappers map.
| private final Map<String, PartitionKey> partitionKeys = new HashMap<>(); | |
| private transient @MonotonicNonNull Map<String, BeamRowWrapper> wrappers; | |
| private transient @MonotonicNonNull Map<String, PartitionKey> partitionKeys; | |
| private transient @MonotonicNonNull Map<String, BeamRowWrapper> wrappers; |
|
|
||
| @Setup | ||
| public void setup() { | ||
| this.wrappers = new HashMap<>(); |
| ValueInSingleWindow.of(element, timestamp, window, paneInfo)); | ||
| Row data = dynamicDestinations.getData(element); | ||
|
|
||
| @Nullable PartitionKey partitionKey = partitionKeys.get(tableIdentifier); |
| } | ||
| partitionKey = new PartitionKey(spec, schema); | ||
| wrapper = new BeamRowWrapper(data.getSchema(), schema.asStruct()); | ||
| partitionKeys.put(tableIdentifier, partitionKey); |
| RecordWriter writer = | ||
| new RecordWriter(table, destination.getFileFormat(), fileName, partitionData); | ||
| for (Row row : element.getValue()) { | ||
| Record record = IcebergUtils.beamRowToIcebergRecord(table.schema(), row); | ||
| writer.write(record); | ||
| } | ||
| writer.close(); |
There was a problem hiding this comment.
The RecordWriter should be closed within a try-finally block. If an exception occurs during the write loop (e.g., due to data corruption or IO issues), the writer will not be closed, potentially leading to resource leaks such as open file handles. Note that writer is still needed after the block for writer.getDataFile(), so a try-finally is more appropriate than try-with-resources here.
RecordWriter writer =
new RecordWriter(table, destination.getFileFormat(), fileName, partitionData);
try {
for (Row row : element.getValue()) {
Record record = IcebergUtils.beamRowToIcebergRecord(table.schema(), row);
writer.write(record);
}
} finally {
writer.close();
}| : Maps.newHashMap(); | ||
|
|
||
| @Nullable Table table = null; | ||
| synchronized (LAST_REFRESHED_TABLE_CACHE) { |
There was a problem hiding this comment.
This synchronized block is missing a second check of the cache (double-checked locking pattern). Multiple threads that miss the cache for the same identifier will wait at the synchronized block and then all proceed to load or create the table sequentially. Adding a second getIfPresent check inside the block avoids redundant catalog operations.
synchronized (LAST_REFRESHED_TABLE_CACHE) {
lastRefreshedTable = LAST_REFRESHED_TABLE_CACHE.getIfPresent(identifier);
if (lastRefreshedTable != null && lastRefreshedTable.table != null) {
lastRefreshedTable.refreshIfStale();
return lastRefreshedTable.table;
}
Adding a new sink code path that groups rows by partition before writing, making partitioned writes a lot more efficient and scalable.