fix(parquet): converting parquet schema with backward compatible repeated struct/primitive with provided arrow schema#8496
Conversation
…ated struct/primitive with provided arrow schema closes: - apache#8495
…primitive-with-inferred-schema # Conflicts: # parquet/src/arrow/schema/complex.rs
alamb
left a comment
There was a problem hiding this comment.
Thank you @rluvaton -- I took a quick review of this PR and the code looks reasonable to me, but I don't understand the legacy inferring logic / problem so I can't really review this PR fully yet
Can someone help me out with a link / document that describes what the legacy inferring is? Is it https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/LogicalTypes.md?plain=1#L718-L723
Poking around that file, it looks like @etseidl may know something about this as he authored several commits, for example
parquet/src/arrow/schema/complex.rs
Outdated
| /// Converts `self` into an arrow list, with its current type as the field type | ||
| /// accept an optional `list_data_type` to specify the type of list to create | ||
| /// | ||
| /// This is used to convert deprecated repeated columns (not in a list), into their arrow representation |
There was a problem hiding this comment.
Is there any reference we can add a link to (I am not familiar with the "deprecated repeated columns" you are referencing here
There was a problem hiding this comment.
There was a problem hiding this comment.
parquet/src/arrow/schema/complex.rs
Outdated
| | Some(DataType::LargeList(field_hint)) | ||
| | Some(DataType::FixedSizeList(field_hint, _)) => Some(field_hint.as_ref()), | ||
| Some(_) => unreachable!( | ||
| "should be validated earlier that list_data_type is only a type of list" |
There was a problem hiding this comment.
even though this should be impossible, I worry about panic'ing here because if there is a bug that error is more severe than "just an error"
ALso, while this may be true at the moment, I can imagine that some future refactor messes it up, in which case this may become reachable and the compiler won't complain
I would prefer returning a general_err! with some sort of "Internal error: should be validated..." type message
|
I will try and find time tomorrow to review this |
|
@alamb Ping :) |
|
I see this one, thank you - unfortunately I have many other PRs ahead of it in the review queue. As I am not super familiar with this part of the spec it will likely take me longer to review as well - -maybe someone else who is more familiar can help out |
| rep_level, | ||
| def_level, | ||
| data_type, | ||
| treat_repeated_as_list_arrow_hint: true, |
There was a problem hiding this comment.
| treat_repeated_as_list_arrow_hint: true, | |
| treat_repeated_as_list_arrow_hint: context.treat_repeated_as_list_arrow_hint, |
There was a problem hiding this comment.
this will actually lead to a bug, added a comment ad wrote a test that fail with your suggestion and pass currently
| _ => Field::new(name, data_type, nullable), | ||
| }; | ||
|
|
||
| Ok(field.with_metadata(hint.metadata().clone())) |
There was a problem hiding this comment.
Should the extension type be added here too ? As for None below.
I.e. something like:
| let merged = field.with_metadata(hint.metadata().clone()); | |
| try_add_extension_type(merged, parquet_type) |
There was a problem hiding this comment.
I think we should but not part of this PR
@alamb so can you please assign the relevant people, so we can merge this 4 months old PR |
I can't assign anyone as I have no actual authority over anyone's time -- All I can do is to beg and/or cajole others into helping do so. One thing that would help me review this pr (and perhaps others as well) is a more complete description / documentation of what this code does, with examples, as I am not all that familiar with the current schema representation of lists (and not at all familiar with the older one) would to help provide context (ideally in comments) of what this PR is doing For example, for me to understand comments like this, I need to go into the referenced URL and then try and match the terminology used there to what is used in this repo and PR. While I can do that it will take time and time is the thing I seem to have the least of /// This is used to convert [deprecated repeated columns] (not in a list), into their arrow representation
Maybe it could give an example so that the code was more self contained. For example, an example of the current List representation and the old (deprecated) representation? |
| /// An optional [`DataType`] sourced from the embedded arrow schema | ||
| data_type: Option<DataType>, | ||
|
|
||
| /// Whether to treat repeated types as list from arrow types |
There was a problem hiding this comment.
Should this also mention that it is for supporting "deprecated list representations" in parquet?
|
I'm trying to come up to speed on this, but an early observation is that one test (
From the test in question: // This is a backward-compatible LIST (rule 4) where the struct element contains
// a repeated primitive. The arrow hint specifies that the inner repeated primitive
// should be LargeList<Int32>.
let message_type = "
message schema {
optional group my_list (LIST) {
repeated group my_list_tuple {
required binary str (STRING);
repeated int32 values;
}
}
}
";That said, plugging the tests from this PR into Other than the test mentioned above, the other failing tests seem like they should succeed. |
Which issue does this PR close?
Rationale for this change
Fix reading old parquet files
What changes are included in this PR?
tests and the fix, but mostly tests.
Are these changes tested?
yes
Are there any user-facing changes?
No