-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Core: Avoid table corruption from 409 on self conflicts after 5xx retries by throwing CommitStateUnknown #12818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
29665a1
to
8665dc1
Compare
core/src/main/java/org/apache/iceberg/rest/RESTTableOperations.java
Outdated
Show resolved
Hide resolved
8665dc1
to
7e86071
Compare
7e86071
to
35bd596
Compare
d75ab7c
to
ceb0b4f
Compare
96c35f0
to
24dfd82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, made offline comment about "was-retried" or "is-retry" vs "is-retried" but that's a nit. I also just think "retry" - true/false is ok
One minor request, please add a test where we check that CSUnknown is actually thrown if possible.
d661510
to
afe2ee5
Compare
Thank you for your feedbacks Russell, I addressed it in the latest commit. |
@amogh-jahagirdar Do you have any other feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the fix looks great @singhpk234 just had a minor comment on a public builder API, that'd be good to address before we release anything
core/src/main/java/org/apache/iceberg/rest/responses/ErrorResponse.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a nit on the comment, but change looks good. Thank you @singhpk234 , this is an important fix!
// If the request was retried, and if its final error was 409, It could probably also | ||
// mean that HTTP retries when happened the IRC service could have actually applied | ||
// the commit in their persistence, while giving back the client still a 5xx. | ||
// If so, since the base has changed, it could conflict with itself. | ||
// In cases like this its best not to mark this as failed instead | ||
// make this is a commit state unknown, so some reconciliation can happen and | ||
// the metadata clean-up is not triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit on comment, maybe something like this?
/**
* If a retried request finally fails with 409,
* the IRC service may have persisted the commit
* despite initial 5xx errors, resulting in a self-conflict on retry
* due to the base changing.
* Mark this failure as commit state unknown rather than failed to prevent file cleanup.
*/
dd34ad5
to
5f651ec
Compare
…ries by throwing CommitStateUnknown (apache#12818)
… 5xx retries by throwing CommitStateUnknown (apache#12818)" This reverts commit 4927610.
…ries by throwing CommitStateUnknown (apache#12818)
…ries by throwing CommitStateUnknown (#12818)
…ries by throwing CommitStateUnknown (apache#12818)
About the change
Solves : #12792 (comment)
This is happening in IRC client we retry on 5xx (this is a http client level retry, no reconciliation, just retrying same request without rebasing) please ref, now this a very know issue that the service despite giving 5xx applies the commit in its persistence (check this with glue non irc issue here) , but client still see its as 5xx, and retries, so when this is retried via http client retry and the persistence has already applied the commit it will now give 409 hence make table delete the metadata of the commit since we get back CommitFailed exception.
The fix here proposes to throw CommitStateUnknown so that client can reconcile the state and at the bare minumum doesn't treat this as failed and clean up the metadata files
Testing
Added new UT