Spec: Update v3 summary, add row lineage #12982

rdblue · 2025-05-06T16:44:06Z

This adds a summary of row lineage changes to the section on v3 changes.

format/spec.md

RussellSpitzer · 2025-05-06T17:37:57Z

format/spec.md

@@ -1680,6 +1680,23 @@ Row-level delete changes:
    * These position delete files must be merged into the DV for a data file when one is created
    * Position delete files that contain deletes for more than one data file need to be kept in table metadata until all deletes are replaced by DVs

+Row lineage changes:
+
+* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` to create a new snapshot


Suggested change

* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` to create a new snapshot

* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` when creating new snapshots

I considered this language, but I think it's more clear to use a singular snapshot. I don't want to imply that you would use next-row-id for multiple snapshots.

I'm 60:40 on my version versus yours so not a huge deal if you don't want to change it

You're right. I committed this change. I think I'm just being pedantic.

format/spec.md

RussellSpitzer · 2025-05-06T18:55:05Z

format/spec.md

+    * Replace any null or missing `_row_id` with the data file's `first_row_id` plus the row's `_pos`
+    * Replace any null or missing `_last_updated_sequence_number` to the data file's `data_sequence_number`
+    * Read any non-null `_row_id` or `_last_updated_sequence_number` without modification
+* When a data file has a null `first_row_id`, readers must produce null for `_row_id` and `_last_updated_sequence_number`


I think we are missing some of the inheritance path here. The difference between "null" and "assigned and null" I think is getting a little bit mixed up

I think maybe for

For backwards compatibility, readers of any snapshot with a missing `first_row_id` must produce null for `_row_id` and `_last_updated_sequence_number` for all rows in that snapshot.

I took another pass and updated for this concern. I added:

New data files are written with a null first_row_id

Inheritance happens when the manifest has a non-null first_row_id

I think the requirements under this point specifically are okay. When the data file has a first_row_id, the value of _row_id must never be read as null or missing. The missing case is for when the data file is entirely new and has no row IDs. And a data file may have only some rows with null _row_id when they are written by a MERGE statement; we inject a null for inserted rows and preserve _row_id for the updated rows. Similar logic applies to _last_updated_sequence_number and the final sub-point is that rows with non-null _row_id or _last_updated_sequence_number should not have those values replaced.

For backwards compatibility, readers of any snapshot with a missing first_row_id
must produce null for _row_id and _last_updated_sequence_number
for all rows in that snapshot.

I did this slightly differently by stating how to handle a null first_row_id when assigning or inheriting. I think it's a bit more clear this way because it documents each link in the chain, manifest list, manifest, and data file.

Co-authored-by: Russell Spitzer <[email protected]>

Spec: Update v3 summary to include row lineage.

c824cab

github-actions bot added the Specification Issues that may introduce spec changes. label May 6, 2025

RussellSpitzer reviewed May 6, 2025

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

RussellSpitzer reviewed May 6, 2025

View reviewed changes

format/spec.md Show resolved Hide resolved

RussellSpitzer reviewed May 6, 2025

View reviewed changes

format/spec.md Show resolved Hide resolved

Update from Russell's comments.

9e487ab

RussellSpitzer reviewed May 6, 2025

View reviewed changes

rdblue and others added 2 commits May 6, 2025 13:53

Update format/spec.md

14c62c8

Co-authored-by: Russell Spitzer <[email protected]>

Clarify assignment / null behavior.

4ebbb78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec: Update v3 summary, add row lineage #12982

Spec: Update v3 summary, add row lineage #12982

rdblue commented May 6, 2025

RussellSpitzer May 6, 2025

rdblue May 6, 2025

RussellSpitzer May 6, 2025

rdblue May 6, 2025

RussellSpitzer May 6, 2025

RussellSpitzer May 6, 2025 •

edited

Loading

rdblue May 6, 2025

rdblue May 6, 2025

	* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` to create a new snapshot
	* Writers must set the table's `next-row-id` and use the existing `next-row-id` as the `first-row-id` when creating new snapshots

Spec: Update v3 summary, add row lineage #12982

Are you sure you want to change the base?

Spec: Update v3 summary, add row lineage #12982

Conversation

rdblue commented May 6, 2025

RussellSpitzer May 6, 2025

Choose a reason for hiding this comment

rdblue May 6, 2025

Choose a reason for hiding this comment

RussellSpitzer May 6, 2025

Choose a reason for hiding this comment

rdblue May 6, 2025

Choose a reason for hiding this comment

RussellSpitzer May 6, 2025

Choose a reason for hiding this comment

RussellSpitzer May 6, 2025 • edited Loading

Choose a reason for hiding this comment

rdblue May 6, 2025

Choose a reason for hiding this comment

rdblue May 6, 2025

Choose a reason for hiding this comment

RussellSpitzer May 6, 2025 •

edited

Loading