Edge case with row filtering and null values.#956
Edge case with row filtering and null values.#956
Conversation
|
Thanks for posting. Are you hoping to also provide the fix? |
I have spent a couple of hours in analyzing already, but i don t really have a deep enough understanding of this part of the code. I will try to spend more time on this in the coming days but i would gladly accept any help with this. If you already have the fix in mind, please, do not hesitate to post! :) |
|
I may be wrong, but I am starting to think that the bug may lie in I have added a print statement in defi, rep, val = read_data_page(infile, schema_helper, ph, cmd,
skip_nulls, selfmade=selfmade)
print(f"val: {val}")
max_defi = schema_helper.max_definition_level(cmd.path_in_schema)
if isinstance(row_filter, np.ndarray):
print("Before filtering:")
print("row_filter")
print(row_filter)
io = index_off + len(val) # will be new index_off
if row_filter[index_off:index_off+len(val)].sum() == 0:
num += len(defi) if defi is not None else len(val)
print("continue statement in row_filter management")
continueThe outputs are (only showing for cat_col): val: [1 3 4]
Before filtering:
row_filter
[False False False True]
continue statement in row_filter managementMy understanding:
|
|
Indeed, I came to the same conclusion :) Fix is pushed: just remove the premature optimisation. The case of there being no good values, but the row-group as a whole still being included will be unusual, and it's not even worth the effort to test for it. |
Actually this makes sense, |
|
Thanks a lot Martin! |
|
@martindurant , I have again spent some time on this test case which is not running with datapage v2, without making real progress. Any help would be great. Error is (in if bit_width in [8, 16, 32] and selfmade:
# special fastpath for cats
outbytes = raw_bytes[pagefile.tell():]
if len(outbytes) == assign[num:num+data_header2.num_values].nbytes:
assign[num:num+data_header2.num_values].view('uint8')[row_filter] = outbytes[row_filter]
else:
if data_header2.num_nulls == 0:
assign[num:num+data_header2.num_values][row_filter] = outbytes[row_filter]
else:
if row_filter is Ellipsis:
assign[num:num+data_header2.num_values][~nulls] = outbytes
else:
> assign[num:num + data_header2.num_values][~nulls[row_filter]] = outbytes[~nulls * row_filter]
E IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 4I think I had identified the problem and corrected this row into (change in this other testing branch): assign[num:num + data_header2.num_values][~nulls[row_filter]] = outbytes[row_filter[~nulls]]but now, the problem seems to move in another part of the elif data_header2.encoding == parquet_thrift.Encoding.PLAIN:
# PLAIN, but with nulls or not in-place conversion
codec = cmd.codec if data_header2.is_compressed else "UNCOMPRESSED"
raw_bytes = decompress_data(np.frombuffer(infile.read(size), "uint8"),
uncompressed_page_size, codec)
values = read_plain(raw_bytes,
cmd.type,
n_values,
width=se.type_length,
utf=se.converted_type == 0)
if data_header2.num_nulls:
if nullable:
> assign[num:num+data_header2.num_values][~nulls[row_filter]] = convert(values, se)[row_filter]
E IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 4This line, I have tried to correct it as well, but then, other test cases are breaking, which is making me wonder if the bug really here is, or if some Please, would have any advise? |
Bug under investigation when combining row filtering and nulls in a DataFrame.
Description is in ticket #957