Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix(parquet): converting parquet schema with backward compatible repeated struct/primitive with provided arrow schema#8496

Open
rluvaton wants to merge 21 commits intoapache:mainfrom
rluvaton:fix-reading-backward-compat-repeated-struct-primitive-with-inferred-schema
Open

fix(parquet): converting parquet schema with backward compatible repeated struct/primitive with provided arrow schema#8496
rluvaton wants to merge 21 commits intoapache:mainfrom
rluvaton:fix-reading-backward-compat-repeated-struct-primitive-with-inferred-schema

Conversation

@rluvaton
Copy link
Member

@rluvaton rluvaton commented Sep 29, 2025

Which issue does this PR close?

Rationale for this change

Fix reading old parquet files

What changes are included in this PR?

tests and the fix, but mostly tests.

Are these changes tested?

yes

Are there any user-facing changes?

No

…ated struct/primitive with provided arrow schema

closes:
- apache#8495
@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 29, 2025
@rluvaton rluvaton marked this pull request as ready for review September 29, 2025 18:33
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rluvaton -- I took a quick review of this PR and the code looks reasonable to me, but I don't understand the legacy inferring logic / problem so I can't really review this PR fully yet

Can someone help me out with a link / document that describes what the legacy inferring is? Is it https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/LogicalTypes.md?plain=1#L718-L723

Poking around that file, it looks like @etseidl may know something about this as he authored several commits, for example

/// Converts `self` into an arrow list, with its current type as the field type
/// accept an optional `list_data_type` to specify the type of list to create
///
/// This is used to convert deprecated repeated columns (not in a list), into their arrow representation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reference we can add a link to (I am not familiar with the "deprecated repeated columns" you are referencing here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@rluvaton rluvaton Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, updated

| Some(DataType::LargeList(field_hint))
| Some(DataType::FixedSizeList(field_hint, _)) => Some(field_hint.as_ref()),
Some(_) => unreachable!(
"should be validated earlier that list_data_type is only a type of list"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though this should be impossible, I worry about panic'ing here because if there is a bug that error is more severe than "just an error"

ALso, while this may be true at the moment, I can imagine that some future refactor messes it up, in which case this may become reachable and the compiler won't complain

I would prefer returning a general_err! with some sort of "Internal error: should be validated..." type message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@alamb
Copy link
Contributor

alamb commented Oct 8, 2025

I will try and find time tomorrow to review this

@rluvaton
Copy link
Member Author

@alamb Ping :)

@alamb
Copy link
Contributor

alamb commented Oct 28, 2025

I see this one, thank you - unfortunately I have many other PRs ahead of it in the review queue.

As I am not super familiar with this part of the spec it will likely take me longer to review as well - -maybe someone else who is more familiar can help out

rep_level,
def_level,
data_type,
treat_repeated_as_list_arrow_hint: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
treat_repeated_as_list_arrow_hint: true,
treat_repeated_as_list_arrow_hint: context.treat_repeated_as_list_arrow_hint,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will actually lead to a bug, added a comment ad wrote a test that fail with your suggestion and pass currently

_ => Field::new(name, data_type, nullable),
};

Ok(field.with_metadata(hint.metadata().clone()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the extension type be added here too ? As for None below.
I.e. something like:

Suggested change
let merged = field.with_metadata(hint.metadata().clone());
try_add_extension_type(merged, parquet_type)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should but not part of this PR

@rluvaton
Copy link
Member Author

rluvaton commented Feb 4, 2026

I see this one, thank you - unfortunately I have many other PRs ahead of it in the review queue.

As I am not super familiar with this part of the spec it will likely take me longer to review as well - -maybe someone else who is more familiar can help out

@alamb so can you please assign the relevant people, so we can merge this 4 months old PR

@alamb
Copy link
Contributor

alamb commented Feb 11, 2026

@alamb so can you please assign the relevant people, so we can merge this 4 months old PR

I can't assign anyone as I have no actual authority over anyone's time -- All I can do is to beg and/or cajole others into helping do so.

One thing that would help me review this pr (and perhaps others as well) is a more complete description / documentation of what this code does, with examples, as I am not all that familiar with the current schema representation of lists (and not at all familiar with the older one) would to help provide context (ideally in comments) of what this PR is doing

For example, for me to understand comments like this, I need to go into the referenced URL and then try and match the terminology used there to what is used in this repo and PR. While I can do that it will take time and time is the thing I seem to have the least of

    /// This is used to convert [deprecated repeated columns] (not in a list), into their arrow representation

Maybe it could give an example so that the code was more self contained. For example, an example of the current List representation and the old (deprecated) representation?

/// An optional [`DataType`] sourced from the embedded arrow schema
data_type: Option<DataType>,

/// Whether to treat repeated types as list from arrow types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also mention that it is for supporting "deprecated list representations" in parquet?

@etseidl
Copy link
Contributor

etseidl commented Feb 12, 2026

I'm trying to come up to speed on this, but an early observation is that one test (backward_compat_list_struct_with_nested_repeated_primitive_respects_arrow_hint) appears to violate this line in the spec:

For all fields in the schema, implementations should use either LIST and MAP annotations or unannotated repeated fields, but not both. When using the annotations, no unannotated repeated types are allowed.

From the test in question:

        // This is a backward-compatible LIST (rule 4) where the struct element contains
        // a repeated primitive. The arrow hint specifies that the inner repeated primitive
        // should be LargeList<Int32>.
        let message_type = "
            message schema {
                optional group my_list (LIST) {
                    repeated group my_list_tuple {
                        required binary str (STRING);
                        repeated int32 values;
                    }
                }
            }
        ";

That said, plugging the tests from this PR into main without the fix yields

failures:
    arrow::schema::complex::tests::backward_compat_list_struct_with_nested_repeated_primitive_respects_arrow_hint
    arrow::schema::complex::tests::convert_schema_with_nested_repeated_struct_and_primitives
    arrow::schema::complex::tests::convert_schema_with_repeated_primitive_should_use_inferred_schema
    arrow::schema::complex::tests::convert_schema_with_repeated_primitive_should_use_inferred_schema_for_list_as_well
    arrow::schema::complex::tests::convert_schema_with_repeated_struct_and_inferred_schema
    arrow::schema::complex::tests::convert_schema_with_repeated_struct_and_inferred_schema_and_field_id

test result: FAILED. 24 passed; 6 failed; 0 ignored; 0 measured; 918 filtered out; finished in 0.64s

Other than the test mentioned above, the other failing tests seem like they should succeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

providing the same schema that read from backward compatible parquet fails: incompatible arrow schema, expected struct got List

4 participants