-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Spec: Update v3 summary, add row lineage #12982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
format/spec.md
Outdated
@@ -1680,6 +1680,23 @@ Row-level delete changes: | |||
* These position delete files must be merged into the DV for a data file when one is created | |||
* Position delete files that contain deletes for more than one data file need to be kept in table metadata until all deletes are replaced by DVs | |||
|
|||
Row lineage changes: | |||
|
|||
* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` to create a new snapshot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` to create a new snapshot | |
* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` when creating new snapshots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this language, but I think it's more clear to use a singular snapshot. I don't want to imply that you would use next-row-id
for multiple snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 60:40 on my version versus yours so not a huge deal if you don't want to change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I committed this change. I think I'm just being pedantic.
* Replace any null or missing `_row_id` with the data file's `first_row_id` plus the row's `_pos` | ||
* Replace any null or missing `_last_updated_sequence_number` to the data file's `data_sequence_number` | ||
* Read any non-null `_row_id` or `_last_updated_sequence_number` without modification | ||
* When a data file has a null `first_row_id`, readers must produce null for `_row_id` and `_last_updated_sequence_number` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are missing some of the inheritance path here. The difference between "null" and "assigned and null" I think is getting a little bit mixed up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe for
For backwards compatibility, readers of any snapshot with a missing `first_row_id`
must produce null for `_row_id` and `_last_updated_sequence_number`
for all rows in that snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another pass and updated for this concern. I added:
- New data files are written with a null
first_row_id
- Inheritance happens when the manifest has a non-null
first_row_id
I think the requirements under this point specifically are okay. When the data file has a first_row_id
, the value of _row_id
must never be read as null or missing. The missing case is for when the data file is entirely new and has no row IDs. And a data file may have only some rows with null _row_id
when they are written by a MERGE statement; we inject a null for inserted rows and preserve _row_id
for the updated rows. Similar logic applies to _last_updated_sequence_number
and the final sub-point is that rows with non-null _row_id
or _last_updated_sequence_number
should not have those values replaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards compatibility, readers of any snapshot with a missing
first_row_id
must produce null for_row_id
and_last_updated_sequence_number
for all rows in that snapshot.
I did this slightly differently by stating how to handle a null first_row_id
when assigning or inheriting. I think it's a bit more clear this way because it documents each link in the chain, manifest list, manifest, and data file.
Co-authored-by: Russell Spitzer <[email protected]>
This adds a summary of row lineage changes to the section on v3 changes.