Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Edge case with row filtering and null values.#956

Open
yohplala wants to merge 8 commits intodask:mainfrom
yohplala:row-filter-and-nulls
Open

Edge case with row filtering and null values.#956
yohplala wants to merge 8 commits intodask:mainfrom
yohplala:row-filter-and-nulls

Conversation

@yohplala
Copy link

@yohplala yohplala commented Mar 26, 2025

Bug under investigation when combining row filtering and nulls in a DataFrame.
Description is in ticket #957

@martindurant
Copy link
Member

Thanks for posting. Are you hoping to also provide the fix?

@yohplala
Copy link
Author

Thanks for posting. Are you hoping to also provide the fix?

I have spent a couple of hours in analyzing already, but i don t really have a deep enough understanding of this part of the code. I will try to spend more time on this in the coming days but i would gladly accept any help with this.
I fear that this secific case needs specific code.
Do you have any idea where this could come from?

If you already have the fix in mind, please, do not hesitate to post! :)

@yohplala
Copy link
Author

yohplala commented Mar 27, 2025

I may be wrong, but I am starting to think that the bug may lie in read_data_page

I have added a print statement in read_col, just after read_data_page is called, like this:

        defi, rep, val = read_data_page(infile, schema_helper, ph, cmd,
                                        skip_nulls, selfmade=selfmade)
        print(f"val: {val}")
        max_defi = schema_helper.max_definition_level(cmd.path_in_schema)
        if isinstance(row_filter, np.ndarray):
            print("Before filtering:")
            print("row_filter")
            print(row_filter)
            io = index_off + len(val)  # will be new index_off
            if row_filter[index_off:index_off+len(val)].sum() == 0:
                num += len(defi) if defi is not None else len(val)
                print("continue statement in row_filter management")
                continue

The outputs are (only showing for cat_col):

val: [1 3 4]
Before filtering:
row_filter
[False False False  True]
continue statement in row_filter management

My understanding:

  • Basically, we are hitting the continue before any value gets assigned.

  • We can see that the sum on row_filter (which is sliced) is 0, which is why we hit the continue statement, and I am hinting this is wrong.

  • val we get from read_data_page has only 3 values, and I think it should have the 4 of them at this stage (why would the NaN not be in val?)

@martindurant
Copy link
Member

Indeed, I came to the same conclusion :)

Fix is pushed: just remove the premature optimisation. The case of there being no good values, but the row-group as a whole still being included will be unusual, and it's not even worth the effort to test for it.

@martindurant
Copy link
Member

val we get from read_data_page has only 3 values, and I think it should have the 4 of them at this stage (why would the NaN not be in val?)

Actually this makes sense, defi holds the NULL/not-NULL mapping; but that was mistakenly not accounted for in this if.

@yohplala
Copy link
Author

Thanks a lot Martin!
I will be able to work again on this beginning of next week. I see the branch with datapage v2 also needs its fix.
Bests,

@yohplala
Copy link
Author

@martindurant , I have again spent some time on this test case which is not running with datapage v2, without making real progress. Any help would be great.

Error is (in core.py, function read_data_page_v2):

            if bit_width in [8, 16, 32] and selfmade:
                # special fastpath for cats
                outbytes = raw_bytes[pagefile.tell():]
                if len(outbytes) == assign[num:num+data_header2.num_values].nbytes:
                    assign[num:num+data_header2.num_values].view('uint8')[row_filter] = outbytes[row_filter]
                else:
                    if data_header2.num_nulls == 0:
                        assign[num:num+data_header2.num_values][row_filter] = outbytes[row_filter]
                    else:
                        if row_filter is Ellipsis:
                            assign[num:num+data_header2.num_values][~nulls] = outbytes
                        else:
>                           assign[num:num + data_header2.num_values][~nulls[row_filter]] = outbytes[~nulls * row_filter]
E                           IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 4

I think I had identified the problem and corrected this row into (change in this other testing branch):

                        assign[num:num + data_header2.num_values][~nulls[row_filter]] = outbytes[row_filter[~nulls]]

but now, the problem seems to move in another part of the read_data_page_v2 function (logs):

        elif data_header2.encoding == parquet_thrift.Encoding.PLAIN:
            # PLAIN, but with nulls or not in-place conversion
            codec = cmd.codec if data_header2.is_compressed else "UNCOMPRESSED"
            raw_bytes = decompress_data(np.frombuffer(infile.read(size), "uint8"),
                                        uncompressed_page_size, codec)
            values = read_plain(raw_bytes,
                                cmd.type,
                                n_values,
                                width=se.type_length,
                                utf=se.converted_type == 0)
            if data_header2.num_nulls:
                if nullable:
>                   assign[num:num+data_header2.num_values][~nulls[row_filter]] = convert(values, se)[row_filter]
E                   IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 4

This line, I have tried to correct it as well, but then, other test cases are breaking, which is making me wonder if the bug really here is, or if some if branch may not be missing?

Please, would have any advise?

@yohplala yohplala closed this Sep 22, 2025
@yohplala yohplala deleted the row-filter-and-nulls branch September 22, 2025 07:32
@yohplala yohplala restored the row-filter-and-nulls branch September 22, 2025 07:35
@yohplala yohplala reopened this Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants