Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@omalley
Copy link
Contributor

@omalley omalley commented Nov 13, 2018

This change pulls the Utf8 Java class from Protobuf, which has a much faster encoder than Java's. In my benchmarks, I see a ~45% speed up in FlatBuffer serialization. The speed up on deserialization is only ~14%.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

@omalley
Copy link
Contributor Author

omalley commented Nov 13, 2018

I signed it!

@googlebot
Copy link

CLAs look good, thanks!

@shivendra14
Copy link
Contributor

Is this Java specific? Anything similar for cpp?

@omalley
Copy link
Contributor Author

omalley commented Nov 14, 2018

I haven't looked at the flatbuffer C++ implementation and how it is converting strings to & from UTF-8, so I don't know. Java stores strings in UTF-16, so you always have to convert. In C++, the encoding for std::string isn't defined by the library, so there probably isn't matching functionality.

The Java code was using the standard Java library for such encoding and creating a new byte buffer with the UTF-8 string and then copying it to the final buffer. By using the UTF-8 encoder from Protobuf, the new code can calculate the size of the UTF-8 bytes ahead of time and directly write them into the final buffer. Both the size calculation and translation have a fast path for ASCII characters.

@aardappel
Copy link
Collaborator

@omalley:
This is generally awesome, since string conversion can be Java's bottleneck, this is very valuable.

A few concerns:

  • The comments mention this implements a relatively strict decoder. I'd be worried that someone's use of FlatBuffers in the field suddenly breaks because they have sloppy UTF-8 data stored. What is the likelyhood of this? Any idea how it might differ in behavior vs the Java SDK one? If invalid UTF-8 is found, how does that affect the String returned?
  • Also uneasy about the use of sun.misc.Unsafe, there are rumors it will be removed (https://www.javaworld.com/article/2952869/java-platform/understanding-sun-misc-unsafe.html)? Is it even supported on all platforms (e.g. all version of Android?, https://gitter.im/scala-android/sbt-android/archives/2018/01/09) Would it simply fall back on the slower path?
    • If it is going to fall back, it would be good that this happens without class warnings.
    • There's also logging statements in the code, would prefer to not have those.
    • How fast is the non-unsafe fallback path? If it is slower than the Java SDK code, that would be a problem. If it is almost as fast as the unsafe code, we could greatly simplify this code by just always using this path.
    • Why is unsafe even needed? UTF-8 is byte based parsing, surely it can't be that slow to just read bytes out of a byte[] ? UnsafeUtil is full of scalar accessors that are not needed.
  • Generally, while I understand that there's an advantage to copying these files as-is from Protobuf just in case we'd want to update them in the future, it is also a LOT of code, quite a bit of which doesn't seem necessary for the FlatBuffers use case. If possible, I'd prefer reduces special case code with less dependencies.

@shivendra14: The C++ implementation currently does no UTF-8 encoding or decoding, UTF-8 is its native format.

@omalley
Copy link
Contributor Author

omalley commented Nov 16, 2018

The comments mention this implements a relatively strict decoder. I'd be worried that someone's use of FlatBuffers in the field suddenly breaks because they have sloppy UTF-8 data stored. What is the likelyhood of this?

The chance is very very low. Many of us have used protobuf for years and I've never talked to anyone who has had a problem with their UTF-8 encoder. I don't think the Java encoder generates such sequences.

Also uneasy about the use of sun.misc.Unsafe, there are rumors it will be removed (https://www.javaworld.com/article/2952869/java-platform/understanding-sun-misc-unsafe.html)? Is it even supported on all platforms (e.g. all version of Android?, https://gitter.im/scala-android/sbt-android/archives/2018/01/09) Would it simply fall back on the slower path?

sigh Yes, Java is trying to remove Unsafe, but as yet haven't proposed a workable solution for projects that get significant speed boosts by using it.

The code detects whether unsafe is available and automatically falls back to the safe path.

There's also logging statements in the code, would prefer to not have those.

Ok.

Why is unsafe even needed? UTF-8 is byte based parsing, surely it can't be that slow to just read bytes out of a byte[] ?

I'll test out the performance of the safe vs unsafe code paths.

Generally, while I understand that there's an advantage to copying these files as-is from Protobuf just in case we'd want to update them in the future, it is also a LOT of code, quite a bit of which doesn't seem necessary for the FlatBuffers use case. If possible, I'd prefer reduces special case code with less dependencies.

Taking the entire pair of classes is a trade off:

  • more code, which is bad
  • less chance of introducing bugs, which is good
  • easier to apply future fixes from protobuf, which is good

It all depends on which you value more highly.

@aardappel
Copy link
Collaborator

It also depends on the speed. If we need Unsafe to reach good speeds, then using the files mostly as-is makes sense. If it is not a big difference with the safe path then getting rid of the unsafe path would be a huge simplification, and worth doing, imho.

@omalley
Copy link
Contributor Author

omalley commented Nov 20, 2018

Ok, I've factored out the faster encoders into an API and three implementations:

  • One that uses the built-in Java encoder
  • One that uses the protobuf "safe" encoder
  • One that uses the protobuf "unsafe" encoder

This code is a first pass and really should have some unit tests. Note that the safe encoder has two variants for array-backed ByteBuffers or non-array ones. The unsafe encoder has three variants the array-backed ByteBuffers, direct ByteBuffers, and neither. The neither case uses the same code as the safe non-array.

On my benchmark, the results I get are:

generated     direct             orig  avgt   10  32.133 ± 1.755  ms/op
generated     direct             safe  avgt   10  12.973 ± 0.189  ms/op
generated     direct           unsafe  avgt   10  11.745 ± 0.231  ms/op
generated      array             orig  avgt   10  45.376 ± 1.546  ms/op
generated      array             safe  avgt   10  14.136 ± 0.286  ms/op
generated      array           unsafe  avgt   10  12.482 ± 0.158  ms/op

So direct ByteBuffer with the unsafe encoder is the fastest. Do you care about the extra 10% to have the unsafe encoder?

@aardappel
Copy link
Collaborator

My gut feeling says that for simplicity, testing and dependencies sake it be nice to go with just the safe encoder. It's a huge improvement over the original already, and that extra 10% (or 3%, from the perspective of the original) is not worth the extra complexity.

Thanks for testing this!

@omalley
Copy link
Contributor Author

omalley commented Dec 15, 2018

Ok, I've removed the unsafe code and the numbers look good. I left the java utf8 encoder as an option in case someone wants to revert to the original encoder.

<groupId>com.google.flatbuffers</groupId>
<artifactId>flatbuffers-java</artifactId>
<version>1.10.0</version>
<version>1.10.1-SNAPSHOT</version>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do need to change the version string before the next release, but I changed it here mostly for my own upstream testing to make sure that I got the new artifact with my change and not the cached 1.10.0.

In my projects, I change the version string just after a release. In your case, I would have changed it to "1.11.0-SNAPSHOT". Just before the release, I would take off the "-SNAPSHOT".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.. I guess we haven't been doing that so far, but ok for now.

@aardappel
Copy link
Collaborator

This looks great! Thanks for factoring out the unsafe stuff. I can merge, though not sure if we need to be changing the version number between releases.

@aardappel aardappel merged commit cb99116 into google:master Dec 17, 2018
@aardappel
Copy link
Collaborator

Merged! Lets see if it has a (positive) impact on Java users :)

@aardappel
Copy link
Collaborator

aardappel commented Jan 31, 2019

@omalley looks like the use of lambdas creates problems with Java 1.7, any way we can work around that?

flatbuffers/FlatBufferBuilder.java:180: error: default methods are not supported in -source 1.7 default void releaseByteBuffer(ByteBuffer bb) { ^ (use -source 8 or higher to enable default methods) 

flatbuffers/java/com/google/flatbuffers/Utf8Old.java:46: error: lambda expressions are not supported in -source 1.7 ThreadLocal.withInitial(() -> new Cache()); ^ (use -source 8 or higher to enable lambda expressions)

related breakage: #4914

zchee pushed a commit to zchee/flatbuffers that referenced this pull request Feb 14, 2019
* Pulling in protobuf's faster UTF-8 encoder.

* Remove Utf8 unsafe code.
@aardappel
Copy link
Collaborator

@omalley
Seems Utf8Old was broken, I had to change this to make it work: ff1a22a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants