-
Notifications
You must be signed in to change notification settings - Fork 4.4k
[IcebergIO] Filter out data files that have already been committed #34264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IcebergIO] Filter out data files that have already been committed #34264
Conversation
// TODO(ahmedabu98): This does not cover concurrent writes from other pipelines, where the | ||
// "last successful snapshot" might reflect commits from other sources. Ideally, we would make | ||
// this stateful, but that is update incompatible. | ||
// TODO(ahmedabu98): add load test pipelines with intentional periodic crashing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not feasible to have a meaningful test for this without Dataflow
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
assign set of reviewers |
Assigning reviewers. If you would like to opt out of this review, comment R: @robertwb for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AppendFilesToTables.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AppendFilesToTables.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
…pache#34264) * remove already committed files * changes * simplify validation * validate without loading the whole collection into memory
…pache#34264) * remove already committed files * changes * simplify validation * validate without loading the whole collection into memory
Makes the sink more resilient to bundle retries.
Fixes #34074