Codestin Search App

yohplala · 2025-03-26T20:33:54Z

Bug under investigation when combining row filtering and nulls in a DataFrame.
Description is in ticket #957

martindurant · 2025-03-27T14:01:08Z

Thanks for posting. Are you hoping to also provide the fix?

yohplala · 2025-03-27T15:34:24Z

Thanks for posting. Are you hoping to also provide the fix?

I have spent a couple of hours in analyzing already, but i don t really have a deep enough understanding of this part of the code. I will try to spend more time on this in the coming days but i would gladly accept any help with this.
I fear that this secific case needs specific code.
Do you have any idea where this could come from?

If you already have the fix in mind, please, do not hesitate to post! :)

yohplala · 2025-03-27T21:09:51Z

I may be wrong, but I am starting to think that the bug may lie in read_data_page

I have added a print statement in read_col, just after read_data_page is called, like this:

        defi, rep, val = read_data_page(infile, schema_helper, ph, cmd,
                                        skip_nulls, selfmade=selfmade)
        print(f"val: {val}")
        max_defi = schema_helper.max_definition_level(cmd.path_in_schema)
        if isinstance(row_filter, np.ndarray):
            print("Before filtering:")
            print("row_filter")
            print(row_filter)
            io = index_off + len(val)  # will be new index_off
            if row_filter[index_off:index_off+len(val)].sum() == 0:
                num += len(defi) if defi is not None else len(val)
                print("continue statement in row_filter management")
                continue

The outputs are (only showing for cat_col):

val: [1 3 4]
Before filtering:
row_filter
[False False False  True]
continue statement in row_filter management

My understanding:

Basically, we are hitting the continue before any value gets assigned.
We can see that the sum on row_filter (which is sliced) is 0, which is why we hit the continue statement, and I am hinting this is wrong.
val we get from read_data_page has only 3 values, and I think it should have the 4 of them at this stage (why would the NaN not be in val?)

martindurant · 2025-03-27T21:33:33Z

Indeed, I came to the same conclusion :)

Fix is pushed: just remove the premature optimisation. The case of there being no good values, but the row-group as a whole still being included will be unusual, and it's not even worth the effort to test for it.

martindurant · 2025-03-27T21:34:49Z

val we get from read_data_page has only 3 values, and I think it should have the 4 of them at this stage (why would the NaN not be in val?)

Actually this makes sense, defi holds the NULL/not-NULL mapping; but that was mistakenly not accounted for in this if.

yohplala · 2025-03-28T21:23:11Z

Thanks a lot Martin!
I will be able to work again on this beginning of next week. I see the branch with datapage v2 also needs its fix.
Bests,

yohplala · 2025-03-30T18:45:31Z

@martindurant , I have again spent some time on this test case which is not running with datapage v2, without making real progress. Any help would be great.

Error is (in core.py, function read_data_page_v2):

            if bit_width in [8, 16, 32] and selfmade:
                # special fastpath for cats
                outbytes = raw_bytes[pagefile.tell():]
                if len(outbytes) == assign[num:num+data_header2.num_values].nbytes:
                    assign[num:num+data_header2.num_values].view('uint8')[row_filter] = outbytes[row_filter]
                else:
                    if data_header2.num_nulls == 0:
                        assign[num:num+data_header2.num_values][row_filter] = outbytes[row_filter]
                    else:
                        if row_filter is Ellipsis:
                            assign[num:num+data_header2.num_values][~nulls] = outbytes
                        else:
>                           assign[num:num + data_header2.num_values][~nulls[row_filter]] = outbytes[~nulls * row_filter]
E                           IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 4

I think I had identified the problem and corrected this row into (change in this other testing branch):

                        assign[num:num + data_header2.num_values][~nulls[row_filter]] = outbytes[row_filter[~nulls]]

but now, the problem seems to move in another part of the read_data_page_v2 function (logs):

        elif data_header2.encoding == parquet_thrift.Encoding.PLAIN:
            # PLAIN, but with nulls or not in-place conversion
            codec = cmd.codec if data_header2.is_compressed else "UNCOMPRESSED"
            raw_bytes = decompress_data(np.frombuffer(infile.read(size), "uint8"),
                                        uncompressed_page_size, codec)
            values = read_plain(raw_bytes,
                                cmd.type,
                                n_values,
                                width=se.type_length,
                                utf=se.converted_type == 0)
            if data_header2.num_nulls:
                if nullable:
>                   assign[num:num+data_header2.num_values][~nulls[row_filter]] = convert(values, se)[row_filter]
E                   IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 4

This line, I have tried to correct it as well, but then, other test cases are breaking, which is making me wonder if the bug really here is, or if some if branch may not be missing?

Please, would have any advise?

Row filter and nulls2

Edge case with row filtering and null values.

30d90fa

yohplala mentioned this pull request Mar 26, 2025

Bug when row-filtering with null values? #957

Open

Remove nothing-to-filter optimisation

fff1950

yohplala mentioned this pull request Mar 30, 2025

Fix categorical data handling with global dictionaries #954

Open

martindurant added 3 commits April 1, 2025 11:35

attempts

6264fb5

remove errant paste

c39a84d

reinstate comment

cfe2f80

yohplala closed this Sep 22, 2025

yohplala deleted the row-filter-and-nulls branch September 22, 2025 07:32

yohplala restored the row-filter-and-nulls branch September 22, 2025 07:35

yohplala reopened this Sep 22, 2025

yohplala added 3 commits September 22, 2025 09:35

Merge pull request #1 from martindurant/row-filter-and-nulls2

3edad72

Row filter and nulls2

Merge branch 'dask:main' into row-filter-and-nulls

c9f7a3d

Update core.py

1c62983

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Edge case with row filtering and null values.#956

Edge case with row filtering and null values.#956
yohplala wants to merge 8 commits intodask:mainfrom
yohplala:row-filter-and-nulls

yohplala commented Mar 26, 2025 •

edited

Loading

Uh oh!

martindurant commented Mar 27, 2025

Uh oh!

yohplala commented Mar 27, 2025

Uh oh!

yohplala commented Mar 27, 2025 •

edited

Loading

Uh oh!

martindurant commented Mar 27, 2025

Uh oh!

martindurant commented Mar 27, 2025

Uh oh!

yohplala commented Mar 28, 2025

Uh oh!

yohplala commented Mar 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yohplala commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Mar 27, 2025

Uh oh!

yohplala commented Mar 27, 2025

Uh oh!

yohplala commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Mar 27, 2025

Uh oh!

martindurant commented Mar 27, 2025

Uh oh!

yohplala commented Mar 28, 2025

Uh oh!

yohplala commented Mar 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yohplala commented Mar 26, 2025 •

edited

Loading

yohplala commented Mar 27, 2025 •

edited

Loading