-
Notifications
You must be signed in to change notification settings - Fork 4.3k
34749 added cache for avro coder to reduce memory footprint #34873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
assign set of reviewers |
@scwhittle this is a follow up to our discussion in #34749 and #34750. I would like to run a test with real data in the dataflow runner, is there a way to get a snapshot build for this PR? |
I think that you should be able to run a pipeline with these changes by publishing the jars to your local maven and then configuring your pipeline to use them. In the repo with this PR./gradlew sdks:java:io:google-cloud-platform:compileJava find local maven pathmvn help:evaluate -Dexpression=settings.localRepository In the project with your pipeline, modify pom.xml to use local repoexample-repo Example Repository file:///complete/path/to/.m2/repository ... # Also modify the BEAM version in the pom.xml to match the version published above. ie (2.64.0) to the SNAPSHOT version (2.65.0-SNAPSHOT)Then if you launch the dataflow pipeline as normal from your pipeline it should pull in the locally published sdk changes. Let me know if that works or if you have problems with that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
...java/extensions/avro/src/main/java/org/apache/beam/sdk/extensions/avro/coders/AvroCoder.java
Outdated
Show resolved
Hide resolved
...java/extensions/avro/src/main/java/org/apache/beam/sdk/extensions/avro/coders/AvroCoder.java
Show resolved
Hide resolved
...tensions/avro/src/main/java/org/apache/beam/sdk/extensions/avro/coders/AvroGenericCoder.java
Outdated
Show resolved
Hide resolved
.../extensions/avro/src/test/java/org/apache/beam/sdk/extensions/avro/coders/AvroCoderTest.java
Show resolved
Hide resolved
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
...tensions/avro/src/main/java/org/apache/beam/sdk/extensions/avro/coders/AvroGenericCoder.java
Outdated
Show resolved
Hide resolved
@@ -354,8 +392,10 @@ public Schema get() { | |||
// an inner coder. | |||
private final EmptyOnDeserializationThreadLocal<BinaryDecoder> decoder; | |||
private final EmptyOnDeserializationThreadLocal<BinaryEncoder> encoder; | |||
private final EmptyOnDeserializationThreadLocal<DatumWriter<T>> writer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't think of this earlier, but I realized that this is going to change the java serialization of this coder which will lead to update incompatability.
Options:
- keep the existing fields and the new transient fields and just don't use existing fields.
- remove the new fields and change existing fields to do the cache lookup in the initialvalue method. I think you could remove readObject then. We are using unnecessary per-thread caching still then.
I think I'd lean towards the first option of keeping the existing fields but not using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scwhittle i see, i guess this is to support pipeline upgrades?
anyway I restored the old fields
...java/extensions/avro/src/main/java/org/apache/beam/sdk/extensions/avro/coders/AvroCoder.java
Show resolved
Hide resolved
.../extensions/avro/src/test/java/org/apache/beam/sdk/extensions/avro/coders/AvroCoderTest.java
Show resolved
Hide resolved
The test failures look unrelated I will retrigger just to be safe though |
Run Java PreCommit |
This addresses issue #34749.
Results on heap allocation:
AvroDatumFactory
before: 32%
after: 0.036%
AvroCoder
before: 69%
after: 11%
As we thought, this improves not only user code, but also sdk performance.
Example taken from another pipeline:
org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState()
before: 43%
after: 10%
Before
After
See also PR #34750 for more details