Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Compute min/max for too long orc string columns #11652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

homar
Copy link
Member

@homar homar commented Mar 24, 2022

Description

Is this change a fix, improvement, new feature, refactoring, or other?
improvement
Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

it is a change to internal trino library

How would you describe this change to a non-technical end user or system administrator?

It allows to create better statistics for ORC files that contains column of type VARCHAR. In some specific cases it may slightly improve performance of reading orc tables.

Documentation

(.) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(.) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Mar 24, 2022
@homar homar force-pushed the homar/compute_min_max_for_too_long_string_orc_column branch 2 times, most recently from 53ca66e to b38c22e Compare March 24, 2022 21:45
return (b & 0xC0) != 0x80;
}

private static byte[] calculateCharToAppend(int utf8CodePoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have this logic in Slice / SliceUtf8 or somewhere.

@homar homar force-pushed the homar/compute_min_max_for_too_long_string_orc_column branch 2 times, most recently from 94f8b84 to f87e11c Compare March 28, 2022 09:28
@@ -74,7 +75,8 @@ public OrcWriterOptions()
DEFAULT_MAX_STRING_STATISTICS_LIMIT,
DEFAULT_MAX_COMPRESSION_BUFFER_SIZE,
ImmutableSet.of(),
DEFAULT_BLOOM_FILTER_FPP);
DEFAULT_BLOOM_FILTER_FPP,
false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the default? i think it should be true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ok I can change this but I didn't want to change current behaviour for all connectors that are using trino-orc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior is just better

throw new IllegalArgumentException("Provided byte array is not a valid utf8 string");
}

private static boolean isUtfBlockStartChar(byte b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

! io.airlift.slice.SliceUtf8#isContinuationByte instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a private method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. let's copy it and mark with a comment it's copied from there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it cleaner to have a method that checks the condition we want instead of having one which result has to be inverted/negated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this one was copied for orc-core

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted isContinuationByte because it's a somewhat familar thing for our codebase (even if it's a private method in arilift)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do

@homar homar force-pushed the homar/compute_min_max_for_too_long_string_orc_column branch 4 times, most recently from d9d5109 to 9deaca9 Compare March 31, 2022 08:49
Comment on lines 207 to 214
if (maxBytes == 0) {
return null;
}
int lastIndex = findLastCharacterInRange(slice, maxBytes);
return slice.slice(0, lastIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can end up returning empty bytes array (empty slice), so why are we special-casing the maxBytes==0 case above

(eg consider an input being 4-byte utf8 sequence, and maxBytes=3; please add a test)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had a check here for a situation when maxBytes is longer than provided byte array but you wrote that it is a dead code. It will obviously fell now for situations when slice is longer than maxBytes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had a check here for a situation when maxBytes is longer than provided byte array

yes, but that's not the situation i'm concerned about here

if (maxBytes == 0) {
return null;
}
int firstRemovedCharacterIndex = findLastCharacterInRange(slice, maxBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will throw (IllegalArgumentException("Provided byte array is not a valid utf8 string")) when input is eg 4-byte utf8 sequence, and maxBytes=3 (please add a test)

let's

  • change so that findLastCharacterInRange returns -1 (or perhaps Optional.empty) in such case
    • return early when we get -1 here
  • remove if (maxBytes == 0) above, as it becomes obsolete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but this will look very ugly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

propose something else?

@findepi
Copy link
Member

findepi commented Mar 31, 2022

CI looks related

2022-03-31T09:41:46.3050094Z [ERROR] io.trino.plugin.hive.TestHiveFileFormats.testOrcOptimizedWriter[10, 1](3)  Time elapsed: 0.226 s  <<< FAILURE!
2022-03-31T09:41:46.3050852Z io.trino.spi.TrinoException: Malformed ORC file. Validation failed [/tmp/trino_test9451668033849576631ORC]
2022-03-31T09:41:46.3051889Z 	at io.trino.plugin.hive.orc.OrcFileWriter.commit(OrcFileWriter.java:204)
2022-03-31T09:41:46.3052558Z 	at io.trino.plugin.hive.AbstractTestHiveFileFormats.createTestFileTrino(AbstractTestHiveFileFormats.java:597)
2022-03-31T09:41:46.3053246Z 	at io.trino.plugin.hive.TestHiveFileFormats$FileFormatAssertion.assertRead(TestHiveFileFormats.java:1281)
2022-03-31T09:41:46.3054043Z 	at io.trino.plugin.hive.TestHiveFileFormats$FileFormatAssertion.isReadableByRecordCursor(TestHiveFileFormats.java:1238)
2022-03-31T09:41:46.3054722Z 	at io.trino.plugin.hive.TestHiveFileFormats.testOrcOptimizedWriter(TestHiveFileFormats.java:328)
2022-03-31T09:41:46.3055714Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2022-03-31T09:41:46.3056401Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2022-03-31T09:41:46.3057065Z 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2022-03-31T09:41:46.3057599Z 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
2022-03-31T09:41:46.3058232Z 	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
2022-03-31T09:41:46.3058749Z 	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
2022-03-31T09:41:46.3059182Z 	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
2022-03-31T09:41:46.3059640Z 	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
2022-03-31T09:41:46.3060149Z 	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
2022-03-31T09:41:46.3060670Z 	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
2022-03-31T09:41:46.3061191Z 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2022-03-31T09:41:46.3061748Z 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2022-03-31T09:41:46.3062282Z 	at java.base/java.lang.Thread.run(Thread.java:829)
2022-03-31T09:41:46.3062754Z Caused by: io.trino.orc.OrcCorruptionException: Malformed ORC file. Validation failed [/tmp/trino_test9451668033849576631ORC]
2022-03-31T09:41:46.3063214Z 	at io.trino.orc.OrcReader.validateFile(OrcReader.java:431)
2022-03-31T09:41:46.3063660Z 	at io.trino.orc.OrcWriter.validate(OrcWriter.java:562)
2022-03-31T09:41:46.3064095Z 	at io.trino.plugin.hive.orc.OrcFileWriter.commit(OrcFileWriter.java:199)
2022-03-31T09:41:46.3064427Z 	... 17 more
2022-03-31T09:41:46.3064959Z Caused by: io.trino.orc.OrcCorruptionException: Malformed ORC file. Write validation failed: unexpected string range in Row group 0 in stripe at offset 3 column 15 statistics [/tmp/trino_test9451668033849576631ORC]
2022-03-31T09:41:46.3065669Z 	at io.trino.orc.OrcWriteValidation.validateColumnStatisticsEquivalent(OrcWriteValidation.java:420)
2022-03-31T09:41:46.3066468Z 	at io.trino.orc.OrcWriteValidation.validateColumnStatisticsEquivalent(OrcWriteValidation.java:364)
2022-03-31T09:41:46.3067111Z 	at io.trino.orc.OrcWriteValidation.validateRowGroupStatistics(OrcWriteValidation.java:331)
2022-03-31T09:41:46.3067669Z 	at io.trino.orc.OrcRecordReader.advanceToNextRowGroup(OrcRecordReader.java:461)
2022-03-31T09:41:46.3068157Z 	at io.trino.orc.OrcRecordReader.nextPage(OrcRecordReader.java:387)
2022-03-31T09:41:46.3068576Z 	at io.trino.orc.OrcReader.validateFile(OrcReader.java:424)
2022-03-31T09:41:46.3068871Z 	... 19 more
2022-03-31T09:41:46.3069486Z 	Suppressed: io.trino.orc.OrcCorruptionException: Malformed ORC file. Write validation failed: unexpected string range in file column 16 statistics [/tmp/trino_test9451668033849576631ORC]
2022-03-31T09:41:46.3070182Z 		at io.trino.orc.OrcWriteValidation.validateColumnStatisticsEquivalent(OrcWriteValidation.java:420)
2022-03-31T09:41:46.3070864Z 		at io.trino.orc.OrcWriteValidation.validateColumnStatisticsEquivalent(OrcWriteValidation.java:364)
2022-03-31T09:41:46.3071470Z 		at io.trino.orc.OrcWriteValidation.validateFileStatistics(OrcWriteValidation.java:220)
2022-03-31T09:41:46.3072069Z 		at io.trino.orc.OrcRecordReader.close(OrcRecordReader.java:372)
2022-03-31T09:41:46.3072484Z 		at io.trino.orc.OrcReader.validateFile(OrcReader.java:413)
2022-03-31T09:41:46.3072771Z 		... 19 more
2022-03-31T09:41:46.3072901Z 

@homar
Copy link
Member Author

homar commented Mar 31, 2022

CI looks related

2022-03-31T09:41:46.3050094Z [ERROR] io.trino.plugin.hive.TestHiveFileFormats.testOrcOptimizedWriter[10, 1](3)  Time elapsed: 0.226 s  <<< FAILURE!
2022-03-31T09:41:46.3050852Z io.trino.spi.TrinoException: Malformed ORC file. Validation failed [/tmp/trino_test9451668033849576631ORC]
2022-03-31T09:41:46.3051889Z 	at io.trino.plugin.hive.orc.OrcFileWriter.commit(OrcFileWriter.java:204)
2022-03-31T09:41:46.3052558Z 	at io.trino.plugin.hive.AbstractTestHiveFileFormats.createTestFileTrino(AbstractTestHiveFileFormats.java:597)
2022-03-31T09:41:46.3053246Z 	at io.trino.plugin.hive.TestHiveFileFormats$FileFormatAssertion.assertRead(TestHiveFileFormats.java:1281)
2022-03-31T09:41:46.3054043Z 	at io.trino.plugin.hive.TestHiveFileFormats$FileFormatAssertion.isReadableByRecordCursor(TestHiveFileFormats.java:1238)
2022-03-31T09:41:46.3054722Z 	at io.trino.plugin.hive.TestHiveFileFormats.testOrcOptimizedWriter(TestHiveFileFormats.java:328)
2022-03-31T09:41:46.3055714Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2022-03-31T09:41:46.3056401Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2022-03-31T09:41:46.3057065Z 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2022-03-31T09:41:46.3057599Z 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
2022-03-31T09:41:46.3058232Z 	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
2022-03-31T09:41:46.3058749Z 	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
2022-03-31T09:41:46.3059182Z 	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
2022-03-31T09:41:46.3059640Z 	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
2022-03-31T09:41:46.3060149Z 	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
2022-03-31T09:41:46.3060670Z 	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
2022-03-31T09:41:46.3061191Z 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2022-03-31T09:41:46.3061748Z 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2022-03-31T09:41:46.3062282Z 	at java.base/java.lang.Thread.run(Thread.java:829)
2022-03-31T09:41:46.3062754Z Caused by: io.trino.orc.OrcCorruptionException: Malformed ORC file. Validation failed [/tmp/trino_test9451668033849576631ORC]
2022-03-31T09:41:46.3063214Z 	at io.trino.orc.OrcReader.validateFile(OrcReader.java:431)
2022-03-31T09:41:46.3063660Z 	at io.trino.orc.OrcWriter.validate(OrcWriter.java:562)
2022-03-31T09:41:46.3064095Z 	at io.trino.plugin.hive.orc.OrcFileWriter.commit(OrcFileWriter.java:199)
2022-03-31T09:41:46.3064427Z 	... 17 more
2022-03-31T09:41:46.3064959Z Caused by: io.trino.orc.OrcCorruptionException: Malformed ORC file. Write validation failed: unexpected string range in Row group 0 in stripe at offset 3 column 15 statistics [/tmp/trino_test9451668033849576631ORC]
2022-03-31T09:41:46.3065669Z 	at io.trino.orc.OrcWriteValidation.validateColumnStatisticsEquivalent(OrcWriteValidation.java:420)
2022-03-31T09:41:46.3066468Z 	at io.trino.orc.OrcWriteValidation.validateColumnStatisticsEquivalent(OrcWriteValidation.java:364)
2022-03-31T09:41:46.3067111Z 	at io.trino.orc.OrcWriteValidation.validateRowGroupStatistics(OrcWriteValidation.java:331)
2022-03-31T09:41:46.3067669Z 	at io.trino.orc.OrcRecordReader.advanceToNextRowGroup(OrcRecordReader.java:461)
2022-03-31T09:41:46.3068157Z 	at io.trino.orc.OrcRecordReader.nextPage(OrcRecordReader.java:387)
2022-03-31T09:41:46.3068576Z 	at io.trino.orc.OrcReader.validateFile(OrcReader.java:424)
2022-03-31T09:41:46.3068871Z 	... 19 more
2022-03-31T09:41:46.3069486Z 	Suppressed: io.trino.orc.OrcCorruptionException: Malformed ORC file. Write validation failed: unexpected string range in file column 16 statistics [/tmp/trino_test9451668033849576631ORC]
2022-03-31T09:41:46.3070182Z 		at io.trino.orc.OrcWriteValidation.validateColumnStatisticsEquivalent(OrcWriteValidation.java:420)
2022-03-31T09:41:46.3070864Z 		at io.trino.orc.OrcWriteValidation.validateColumnStatisticsEquivalent(OrcWriteValidation.java:364)
2022-03-31T09:41:46.3071470Z 		at io.trino.orc.OrcWriteValidation.validateFileStatistics(OrcWriteValidation.java:220)
2022-03-31T09:41:46.3072069Z 		at io.trino.orc.OrcRecordReader.close(OrcRecordReader.java:372)
2022-03-31T09:41:46.3072484Z 		at io.trino.orc.OrcReader.validateFile(OrcReader.java:413)
2022-03-31T09:41:46.3072771Z 		... 19 more
2022-03-31T09:41:46.3072901Z 

CI is related and I already spent 2 hours trying to fix it ;)

@homar homar force-pushed the homar/compute_min_max_for_too_long_string_orc_column branch from 9deaca9 to 746e876 Compare March 31, 2022 14:15
@findepi findepi merged commit ea2cc01 into trinodb:master Mar 31, 2022
@findepi findepi mentioned this pull request Mar 31, 2022
@github-actions github-actions bot added this to the 376 milestone Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants