-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Java: Pulling in protobuf's faster UTF-8 encoder. #5035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
|
|
I signed it! |
|
CLAs look good, thanks! |
|
Is this Java specific? Anything similar for cpp? |
|
I haven't looked at the flatbuffer C++ implementation and how it is converting strings to & from UTF-8, so I don't know. Java stores strings in UTF-16, so you always have to convert. In C++, the encoding for std::string isn't defined by the library, so there probably isn't matching functionality. The Java code was using the standard Java library for such encoding and creating a new byte buffer with the UTF-8 string and then copying it to the final buffer. By using the UTF-8 encoder from Protobuf, the new code can calculate the size of the UTF-8 bytes ahead of time and directly write them into the final buffer. Both the size calculation and translation have a fast path for ASCII characters. |
|
@omalley: A few concerns:
@shivendra14: The C++ implementation currently does no UTF-8 encoding or decoding, UTF-8 is its native format. |
The chance is very very low. Many of us have used protobuf for years and I've never talked to anyone who has had a problem with their UTF-8 encoder. I don't think the Java encoder generates such sequences.
sigh Yes, Java is trying to remove Unsafe, but as yet haven't proposed a workable solution for projects that get significant speed boosts by using it. The code detects whether unsafe is available and automatically falls back to the safe path.
Ok.
I'll test out the performance of the safe vs unsafe code paths.
Taking the entire pair of classes is a trade off:
It all depends on which you value more highly. |
|
It also depends on the speed. If we need |
|
Ok, I've factored out the faster encoders into an API and three implementations:
This code is a first pass and really should have some unit tests. Note that the safe encoder has two variants for array-backed ByteBuffers or non-array ones. The unsafe encoder has three variants the array-backed ByteBuffers, direct ByteBuffers, and neither. The neither case uses the same code as the safe non-array. On my benchmark, the results I get are: So direct ByteBuffer with the unsafe encoder is the fastest. Do you care about the extra 10% to have the unsafe encoder? |
|
My gut feeling says that for simplicity, testing and dependencies sake it be nice to go with just the safe encoder. It's a huge improvement over the original already, and that extra 10% (or 3%, from the perspective of the original) is not worth the extra complexity. Thanks for testing this! |
|
Ok, I've removed the unsafe code and the numbers look good. I left the java utf8 encoder as an option in case someone wants to revert to the original encoder. |
| <groupId>com.google.flatbuffers</groupId> | ||
| <artifactId>flatbuffers-java</artifactId> | ||
| <version>1.10.0</version> | ||
| <version>1.10.1-SNAPSHOT</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do need to change the version string before the next release, but I changed it here mostly for my own upstream testing to make sure that I got the new artifact with my change and not the cached 1.10.0.
In my projects, I change the version string just after a release. In your case, I would have changed it to "1.11.0-SNAPSHOT". Just before the release, I would take off the "-SNAPSHOT".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.. I guess we haven't been doing that so far, but ok for now.
|
This looks great! Thanks for factoring out the unsafe stuff. I can merge, though not sure if we need to be changing the version number between releases. |
|
Merged! Lets see if it has a (positive) impact on Java users :) |
|
@omalley looks like the use of lambdas creates problems with Java 1.7, any way we can work around that? related breakage: #4914 |
* Pulling in protobuf's faster UTF-8 encoder. * Remove Utf8 unsafe code.
This change pulls the Utf8 Java class from Protobuf, which has a much faster encoder than Java's. In my benchmarks, I see a ~45% speed up in FlatBuffer serialization. The speed up on deserialization is only ~14%.