Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Spec: Update v3 summary, add row lineage #12982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented May 6, 2025

This adds a summary of row lineage changes to the section on v3 changes.

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label May 6, 2025
format/spec.md Outdated
@@ -1680,6 +1680,23 @@ Row-level delete changes:
* These position delete files must be merged into the DV for a data file when one is created
* Position delete files that contain deletes for more than one data file need to be kept in table metadata until all deletes are replaced by DVs

Row lineage changes:

* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` to create a new snapshot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` to create a new snapshot
* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` when creating new snapshots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this language, but I think it's more clear to use a singular snapshot. I don't want to imply that you would use next-row-id for multiple snapshots.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 60:40 on my version versus yours so not a huge deal if you don't want to change it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I committed this change. I think I'm just being pedantic.

* Replace any null or missing `_row_id` with the data file's `first_row_id` plus the row's `_pos`
* Replace any null or missing `_last_updated_sequence_number` to the data file's `data_sequence_number`
* Read any non-null `_row_id` or `_last_updated_sequence_number` without modification
* When a data file has a null `first_row_id`, readers must produce null for `_row_id` and `_last_updated_sequence_number`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing some of the inheritance path here. The difference between "null" and "assigned and null" I think is getting a little bit mixed up

Copy link
Member

@RussellSpitzer RussellSpitzer May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe for

For backwards compatibility, readers of any snapshot with a missing `first_row_id` 
must produce null for `_row_id` and `_last_updated_sequence_number` 
for all rows in that snapshot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass and updated for this concern. I added:

  • New data files are written with a null first_row_id
  • Inheritance happens when the manifest has a non-null first_row_id

I think the requirements under this point specifically are okay. When the data file has a first_row_id, the value of _row_id must never be read as null or missing. The missing case is for when the data file is entirely new and has no row IDs. And a data file may have only some rows with null _row_id when they are written by a MERGE statement; we inject a null for inserted rows and preserve _row_id for the updated rows. Similar logic applies to _last_updated_sequence_number and the final sub-point is that rows with non-null _row_id or _last_updated_sequence_number should not have those values replaced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards compatibility, readers of any snapshot with a missing first_row_id
must produce null for _row_id and _last_updated_sequence_number
for all rows in that snapshot.

I did this slightly differently by stating how to handle a null first_row_id when assigning or inheriting. I think it's a bit more clear this way because it documents each link in the chain, manifest list, manifest, and data file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants