Thanks to visit codestin.com
Credit goes to github.com

Skip to content

34749 added cache for avro coder to reduce memory footprint #34873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

wollowizard
Copy link
Contributor

@wollowizard wollowizard commented May 6, 2025

This addresses issue #34749.

Results on heap allocation:

AvroDatumFactory
before: 32%
after: 0.036%

AvroCoder
before: 69%
after: 11%

As we thought, this improves not only user code, but also sdk performance.
Example taken from another pipeline:
org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState()
before: 43%
after: 10%

Before

image image image

After

image image image

See also PR #34750 for more details

@wollowizard
Copy link
Contributor Author

assign set of reviewers

@wollowizard
Copy link
Contributor Author

wollowizard commented May 6, 2025

@scwhittle this is a follow up to our discussion in #34749 and #34750. I would like to run a test with real data in the dataflow runner, is there a way to get a snapshot build for this PR?

@scwhittle
Copy link
Contributor

@scwhittle this is a follow up to our discussion in #34749 and #34750. I would like to run a test with real data in the dataflow runner, is there a way to get a snapshot build for this PR?

I think that you should be able to run a pipeline with these changes by publishing the jars to your local maven and then configuring your pipeline to use them.

In the repo with this PR

./gradlew sdks:java:io:google-cloud-platform:compileJava
&& ./gradlew -Ppublishing -p sdks/java/io/google-cloud-platform publishToMavenLocal

find local maven path

mvn help:evaluate -Dexpression=settings.localRepository

In the project with your pipeline, modify pom.xml to use local repo

example-repo Example Repository file:///complete/path/to/.m2/repository ... # Also modify the BEAM version in the pom.xml to match the version published above. ie (2.64.0) to the SNAPSHOT version (2.65.0-SNAPSHOT)

Then if you launch the dataflow pipeline as normal from your pipeline it should pull in the locally published sdk changes. Let me know if that works or if you have problems with that

Copy link
Contributor

@scwhittle scwhittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

github-actions bot commented May 7, 2025

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@wollowizard wollowizard requested a review from scwhittle May 7, 2025 19:02
@@ -354,8 +392,10 @@ public Schema get() {
// an inner coder.
private final EmptyOnDeserializationThreadLocal<BinaryDecoder> decoder;
private final EmptyOnDeserializationThreadLocal<BinaryEncoder> encoder;
private final EmptyOnDeserializationThreadLocal<DatumWriter<T>> writer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't think of this earlier, but I realized that this is going to change the java serialization of this coder which will lead to update incompatability.

Options:

  • keep the existing fields and the new transient fields and just don't use existing fields.
  • remove the new fields and change existing fields to do the cache lookup in the initialvalue method. I think you could remove readObject then. We are using unnecessary per-thread caching still then.

I think I'd lean towards the first option of keeping the existing fields but not using them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scwhittle i see, i guess this is to support pipeline upgrades?
anyway I restored the old fields

@wollowizard wollowizard requested a review from scwhittle May 9, 2025 11:32
@scwhittle
Copy link
Contributor

The test failures look unrelated I will retrigger just to be safe though

@scwhittle
Copy link
Contributor

Run Java PreCommit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants