-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Spec: clarify the partition-spec metadata for Avro manifest file #13895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
format/spec.md
Outdated
|------------|------------|---------------------|----------------------------------------------------------------------------------------------------| | ||
| _required_ | _required_ | `schema` | JSON representation of the table schema at the time the manifest was written | | ||
| _optional_ | _required_ | `schema-id` | ID of the schema used to write the manifest as a string | | ||
| _required_ | _required_ | `partition-spec` | JSON representation of the partition fields array of the partition spec used to write the manifest | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doubt-1] how is the spec-id inferred when the partition-spec-id is empty (since its optional in v1)
[discuss] is it worth calling out that the ParitionSpec if first converted to unbounded partition spec whats put here is the unbounded version of partition field (where transform is just string) rather than Transform object ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manifest reader in iceberg-core only use the Avro metadata as backup, as manifest file entry (from manifest list) contains the partition_spec_id
.
Good call-out for the unbounded transform part. maybe I can add a link to the spec
https://iceberg.apache.org/spec/#partition-specs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also seen this in the past. I don't see any harm in making this more explicit 👍
Co-authored-by: Fokko Driesprong <[email protected]>
Could you share who was generating the files? We had a user report this but never figured out the origin? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too, thanks @stevenzwu !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged. Thanks @stevenzwu for the PR! Thanks @singhpk234 @Fokko @RussellSpitzer for the review! |
In the field, we have seen some writer producing non-conform metadata in manifest Avro file.
Probably the spec wording caused the mis-interpretation that it is the JSON serialization of the whole partition spec. But the spec and the Java reference implementation meant only the partition fields array of the partition spec, as the spec id was encoded as a separate metadata field in the Avro file.
This PR is to clarify this metadata field for manifest Avro file.