Thanks to visit codestin.com
Credit goes to github.com

Skip to content

How should I handle arrays of structs,  #17

@dandexcare

Description

@dandexcare

How do I handle a field with array of struct values like the following? I will send a full example shortly :

df.with_columns(pl.struct('aggregate_ratings.sub_ratings').alias('aggregate_ratings.sub_ratings').map(to_json, return_dtype=pl.Utf8))

id = pa.array([1,2,3])
complicated = pa.array([[{'average_rating': 4.9, 'crawled_date': '2023-06-06'},{'average_rating': 4.7, 'crawled_date': '2023-06-04'}]
                        ,[{'average_rating': 4.8, 'crawled_date': '2023-05-06'},{'average_rating': 4.6, 'crawled_date': '2023-05-04'}]
                        ,[{'average_rating': 4.7, 'crawled_date': '2023-04-06'},{'average_rating': 4.5, 'crawled_date': '2023-04-04'}]])
names = ["id", "complicated"]
complicated = array_to_json(complicated) 
df = pa.RecordBatch.from_arrays(
        [
            pa.array([0, 1, 2]),
            pa.array(array_to_json(complicated),              
            type=pa.list_(pa.struct(pa.field("average_rating", pa.double()),pa.field("crawled_date", pa.large_string()))),
            ),
        ],
        schema=pa.schema(
            [ ("id", pa.int32()),
                pa.field(
                    "complicated",
                    pa.list_(pa.struct(pa.field("average_rating", pa.double()),pa.field("crawled_date", pa.large_string()))),
                ),
            ]
        ),names=names).to_pandas()

print(df)

I made an attempt with the following encoder but it fails on the copy because the output tye is Jsonb() instead of jsonb[]:

encoder = ArrowToPostgresBinaryEncoder.new_with_encoders(
    schema,
    {   'main_key': Int32EncoderBuilder(schema.field('main_key')),
        'aggregate_ratings.sub_ratings': LargeStringEncoderBuilder.new_with_output(
        schema[schema.get_field_index('aggregate_ratings.sub_ratings')],
        Jsonb()
    )}
)

Error:

psycopg.errors.QueryCanceled: COPY from stdin failed: error from Python: PanicException - called `Result::unwrap()` on an `Err` value: ColumnTypeMismatch { field: "aggregate_ratings.sub_ratings", expected: "arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<i64>>", actual: LargeList(Field { name: "item", data_type: Struct([Field { name: "average_rating", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "crawled_date", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "metric", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions