Java: Pulling in protobuf's faster UTF-8 encoder. #5035

omalley · 2018-11-13T21:28:23Z

This change pulls the Utf8 Java class from Protobuf, which has a much faster encoder than Java's. In my benchmarks, I see a ~45% speed up in FlatBuffer serialization. The speed up on deserialization is only ~14%.

googlebot · 2018-11-13T21:28:26Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

omalley · 2018-11-13T21:32:26Z

I signed it!

googlebot · 2018-11-13T21:32:28Z

CLAs look good, thanks!

shivendra14 · 2018-11-14T16:02:33Z

Is this Java specific? Anything similar for cpp?

omalley · 2018-11-14T16:47:04Z

I haven't looked at the flatbuffer C++ implementation and how it is converting strings to & from UTF-8, so I don't know. Java stores strings in UTF-16, so you always have to convert. In C++, the encoding for std::string isn't defined by the library, so there probably isn't matching functionality.

The Java code was using the standard Java library for such encoding and creating a new byte buffer with the UTF-8 string and then copying it to the final buffer. By using the UTF-8 encoder from Protobuf, the new code can calculate the size of the UTF-8 bytes ahead of time and directly write them into the final buffer. Both the size calculation and translation have a fast path for ASCII characters.

aardappel · 2018-11-15T22:36:50Z

@omalley:
This is generally awesome, since string conversion can be Java's bottleneck, this is very valuable.

A few concerns:

The comments mention this implements a relatively strict decoder. I'd be worried that someone's use of FlatBuffers in the field suddenly breaks because they have sloppy UTF-8 data stored. What is the likelyhood of this? Any idea how it might differ in behavior vs the Java SDK one? If invalid UTF-8 is found, how does that affect the String returned?
Also uneasy about the use of sun.misc.Unsafe, there are rumors it will be removed (https://www.javaworld.com/article/2952869/java-platform/understanding-sun-misc-unsafe.html)? Is it even supported on all platforms (e.g. all version of Android?, https://gitter.im/scala-android/sbt-android/archives/2018/01/09) Would it simply fall back on the slower path?
- If it is going to fall back, it would be good that this happens without class warnings.
- There's also logging statements in the code, would prefer to not have those.
- How fast is the non-unsafe fallback path? If it is slower than the Java SDK code, that would be a problem. If it is almost as fast as the unsafe code, we could greatly simplify this code by just always using this path.
- Why is unsafe even needed? UTF-8 is byte based parsing, surely it can't be that slow to just read bytes out of a byte[] ? UnsafeUtil is full of scalar accessors that are not needed.
Generally, while I understand that there's an advantage to copying these files as-is from Protobuf just in case we'd want to update them in the future, it is also a LOT of code, quite a bit of which doesn't seem necessary for the FlatBuffers use case. If possible, I'd prefer reduces special case code with less dependencies.

@shivendra14: The C++ implementation currently does no UTF-8 encoding or decoding, UTF-8 is its native format.

omalley · 2018-11-16T19:16:19Z

The comments mention this implements a relatively strict decoder. I'd be worried that someone's use of FlatBuffers in the field suddenly breaks because they have sloppy UTF-8 data stored. What is the likelyhood of this?

The chance is very very low. Many of us have used protobuf for years and I've never talked to anyone who has had a problem with their UTF-8 encoder. I don't think the Java encoder generates such sequences.

Also uneasy about the use of sun.misc.Unsafe, there are rumors it will be removed (https://www.javaworld.com/article/2952869/java-platform/understanding-sun-misc-unsafe.html)? Is it even supported on all platforms (e.g. all version of Android?, https://gitter.im/scala-android/sbt-android/archives/2018/01/09) Would it simply fall back on the slower path?

sigh Yes, Java is trying to remove Unsafe, but as yet haven't proposed a workable solution for projects that get significant speed boosts by using it.

The code detects whether unsafe is available and automatically falls back to the safe path.

There's also logging statements in the code, would prefer to not have those.

Ok.

Why is unsafe even needed? UTF-8 is byte based parsing, surely it can't be that slow to just read bytes out of a byte[] ?

I'll test out the performance of the safe vs unsafe code paths.

Generally, while I understand that there's an advantage to copying these files as-is from Protobuf just in case we'd want to update them in the future, it is also a LOT of code, quite a bit of which doesn't seem necessary for the FlatBuffers use case. If possible, I'd prefer reduces special case code with less dependencies.

Taking the entire pair of classes is a trade off:

more code, which is bad
less chance of introducing bugs, which is good
easier to apply future fixes from protobuf, which is good

It all depends on which you value more highly.

aardappel · 2018-11-16T19:48:38Z

It also depends on the speed. If we need Unsafe to reach good speeds, then using the files mostly as-is makes sense. If it is not a big difference with the safe path then getting rid of the unsafe path would be a huge simplification, and worth doing, imho.

omalley · 2018-11-20T15:20:39Z

Ok, I've factored out the faster encoders into an API and three implementations:

One that uses the built-in Java encoder
One that uses the protobuf "safe" encoder
One that uses the protobuf "unsafe" encoder

This code is a first pass and really should have some unit tests. Note that the safe encoder has two variants for array-backed ByteBuffers or non-array ones. The unsafe encoder has three variants the array-backed ByteBuffers, direct ByteBuffers, and neither. The neither case uses the same code as the safe non-array.

On my benchmark, the results I get are:

generated     direct             orig  avgt   10  32.133 ± 1.755  ms/op
generated     direct             safe  avgt   10  12.973 ± 0.189  ms/op
generated     direct           unsafe  avgt   10  11.745 ± 0.231  ms/op
generated      array             orig  avgt   10  45.376 ± 1.546  ms/op
generated      array             safe  avgt   10  14.136 ± 0.286  ms/op
generated      array           unsafe  avgt   10  12.482 ± 0.158  ms/op

So direct ByteBuffer with the unsafe encoder is the fastest. Do you care about the extra 10% to have the unsafe encoder?

aardappel · 2018-11-26T21:28:09Z

My gut feeling says that for simplicity, testing and dependencies sake it be nice to go with just the safe encoder. It's a huge improvement over the original already, and that extra 10% (or 3%, from the perspective of the original) is not worth the extra complexity.

Thanks for testing this!

omalley · 2018-12-15T00:58:26Z

Ok, I've removed the unsafe code and the numbers look good. I left the java utf8 encoder as an option in case someone wants to revert to the original encoder.

aardappel · 2018-12-17T19:55:20Z

pom.xml

  <groupId>com.google.flatbuffers</groupId>
  <artifactId>flatbuffers-java</artifactId>
-  <version>1.10.0</version>
+  <version>1.10.1-SNAPSHOT</version>


Why this change?

You do need to change the version string before the next release, but I changed it here mostly for my own upstream testing to make sure that I got the new artifact with my change and not the cached 1.10.0.

In my projects, I change the version string just after a release. In your case, I would have changed it to "1.11.0-SNAPSHOT". Just before the release, I would take off the "-SNAPSHOT".

Ok.. I guess we haven't been doing that so far, but ok for now.

aardappel · 2018-12-17T20:00:23Z

This looks great! Thanks for factoring out the unsafe stuff. I can merge, though not sure if we need to be changing the version number between releases.

aardappel · 2018-12-17T21:54:35Z

Merged! Lets see if it has a (positive) impact on Java users :)

aardappel · 2019-01-31T16:57:31Z

@omalley looks like the use of lambdas creates problems with Java 1.7, any way we can work around that?

flatbuffers/FlatBufferBuilder.java:180: error: default methods are not supported in -source 1.7 default void releaseByteBuffer(ByteBuffer bb) { ^ (use -source 8 or higher to enable default methods) 

flatbuffers/java/com/google/flatbuffers/Utf8Old.java:46: error: lambda expressions are not supported in -source 1.7 ThreadLocal.withInitial(() -> new Cache()); ^ (use -source 8 or higher to enable lambda expressions)

related breakage: #4914

* Pulling in protobuf's faster UTF-8 encoder. * Remove Utf8 unsafe code.

aardappel · 2019-06-24T23:57:26Z

@omalley
Seems Utf8Old was broken, I had to change this to make it work: ff1a22a

omalley added 2 commits December 14, 2018 16:17

Pulling in protobuf's faster UTF-8 encoder.

21986c8

Remove Utf8 unsafe code.

cce8241

omalley force-pushed the utf-8 branch from 4594739 to cce8241 Compare December 15, 2018 00:52

aardappel reviewed Dec 17, 2018

View reviewed changes

aardappel merged commit cb99116 into google:master Dec 17, 2018

aardappel mentioned this pull request Jan 31, 2019

Java - Add ByteBufferFactory#releaseByteBuffer - enable pooling of ByteBuffers #4914

Merged

zchee pushed a commit to zchee/flatbuffers that referenced this pull request Feb 14, 2019

Java: Pulling in protobuf's faster UTF-8 encoder. (google#5035)

16d222c

* Pulling in protobuf's faster UTF-8 encoder. * Remove Utf8 unsafe code.

Java: Pulling in protobuf's faster UTF-8 encoder. #5035

Java: Pulling in protobuf's faster UTF-8 encoder. #5035

Uh oh!

Conversation

omalley commented Nov 13, 2018

Uh oh!

googlebot commented Nov 13, 2018

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

omalley commented Nov 13, 2018

Uh oh!

googlebot commented Nov 13, 2018

Uh oh!

shivendra14 commented Nov 14, 2018

Uh oh!

omalley commented Nov 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aardappel commented Nov 15, 2018

Uh oh!

omalley commented Nov 16, 2018

Uh oh!

aardappel commented Nov 16, 2018

Uh oh!

omalley commented Nov 20, 2018

Uh oh!

aardappel commented Nov 26, 2018

Uh oh!

omalley commented Dec 15, 2018

Uh oh!

aardappel Dec 17, 2018

Choose a reason for hiding this comment

Uh oh!

omalley Dec 17, 2018

Choose a reason for hiding this comment

Uh oh!

aardappel Dec 17, 2018

Choose a reason for hiding this comment

Uh oh!

aardappel commented Dec 17, 2018

Uh oh!

aardappel commented Dec 17, 2018

Uh oh!

aardappel commented Jan 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aardappel commented Jun 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

omalley commented Nov 14, 2018 •

edited

Loading

aardappel commented Jan 31, 2019 •

edited

Loading