-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Compute min/max for too long orc string columns #11652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute min/max for too long orc string columns #11652
Conversation
53ca66e
to
b38c22e
Compare
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
return (b & 0xC0) != 0x80; | ||
} | ||
|
||
private static byte[] calculateCharToAppend(int utf8CodePoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have this logic in Slice / SliceUtf8 or somewhere.
94f8b84
to
f87e11c
Compare
@@ -74,7 +75,8 @@ public OrcWriterOptions() | |||
DEFAULT_MAX_STRING_STATISTICS_LIMIT, | |||
DEFAULT_MAX_COMPRESSION_BUFFER_SIZE, | |||
ImmutableSet.of(), | |||
DEFAULT_BLOOM_FILTER_FPP); | |||
DEFAULT_BLOOM_FILTER_FPP, | |||
false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the default? i think it should be true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm ok I can change this but I didn't want to change current behaviour for all connectors that are using trino-orc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new behavior is just better
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
throw new IllegalArgumentException("Provided byte array is not a valid utf8 string"); | ||
} | ||
|
||
private static boolean isUtfBlockStartChar(byte b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
! io.airlift.slice.SliceUtf8#isContinuationByte
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a private method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. let's copy it and mark with a comment it's copied from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it cleaner to have a method that checks the condition we want instead of having one which result has to be inverted/negated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this one was copied for orc-core
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted isContinuationByte
because it's a somewhat familar thing for our codebase (even if it's a private method in arilift)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do
d9d5109
to
9deaca9
Compare
if (maxBytes == 0) { | ||
return null; | ||
} | ||
int lastIndex = findLastCharacterInRange(slice, maxBytes); | ||
return slice.slice(0, lastIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can end up returning empty bytes array (empty slice), so why are we special-casing the maxBytes==0 case above
(eg consider an input being 4-byte utf8 sequence, and maxBytes=3; please add a test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I had a check here for a situation when maxBytes is longer than provided byte array but you wrote that it is a dead code. It will obviously fell now for situations when slice is longer than maxBytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I had a check here for a situation when maxBytes is longer than provided byte array
yes, but that's not the situation i'm concerned about here
if (maxBytes == 0) { | ||
return null; | ||
} | ||
int firstRemovedCharacterIndex = findLastCharacterInRange(slice, maxBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will throw (IllegalArgumentException("Provided byte array is not a valid utf8 string")
) when input is eg 4-byte utf8 sequence, and maxBytes=3 (please add a test)
let's
- change so that
findLastCharacterInRange
returns-1
(or perhaps Optional.empty) in such case- return early when we get -1 here
- remove
if (maxBytes == 0)
above, as it becomes obsolete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but this will look very ugly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
propose something else?
CI looks related
|
CI is related and I already spent 2 hours trying to fix it ;) |
9deaca9
to
746e876
Compare
Description
it is a change to internal trino library
It allows to create better statistics for ORC files that contains column of type VARCHAR. In some specific cases it may slightly improve performance of reading orc tables.
Documentation
(.) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(.) No release notes entries required.
( ) Release notes entries required with the following suggested text: