-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1744: [Plasma] Provide TensorFlow operator to transfer Tensors between Plasma and TensorFlow #2046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bd9aa43 to
5d0174a
Compare
Codecov Report
@@ Coverage Diff @@
## master #2046 +/- ##
==========================================
+ Coverage 86.28% 86.31% +0.03%
==========================================
Files 242 244 +2
Lines 41042 41192 +150
==========================================
+ Hits 35413 35556 +143
- Misses 5629 5636 +7
Continue to review full report at Codecov.
|
4a09f87 to
3cc3362
Compare
|
Just a bit curious: this sounds like some translation between TF data and plasma objects, is the purpose of the change to support TensorFlow to use Plasma as store? |
|
@zhijunfu This op is supposed to be useful to transfer data between Plasma and TensorFlow. Supporting machine learning workloads seamlessly is very important for us and we want to make the interop between TensorFlow and Plasma as high performance as possible. This op will be fully optional and there will be a flag that is off by default which determines if the op gets built/included or not, so there will be no dependency on TensorFlow for users if they don't want to use the op. You are right that the op interacts with plasma through the client interface. [This is work in progress] Also cc'ing @concretevitamin, who is the author of the op. |
|
@pcmoritz Got it. Thanks for the explanations. |
8644be8 to
ae2e7ca
Compare
e3c44db to
6a44a5b
Compare
6a44a5b to
c705d53
Compare
| endif() | ||
| set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=0") | ||
| include_directories(${TensorFlow_INCLUDE_DIRS}) | ||
| include_directories(${TensorFlow_INCLUDE_DIRS}/external/nsync/public) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, (1) nsync is not included in the tf.sysconfig output? (2) orthogonal to last question, is this needed here? I never needed it to compile by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only needed for python 2.7 and yes, it's not included in the tf.sysconfig output. See sadeepj/crfasrnn_keras#19 which I'm putting in.
cpp/src/plasma/tf_plasma_op.cc
Outdated
| } | ||
| } | ||
|
|
||
| // #define PLASMA_CLIENT_DOES_NOT_EXIST 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commented out block can all be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing this now
|
|
||
| REGISTER_KERNEL_BUILDER(Name("TensorToPlasma").Device(DEVICE_CPU), | ||
| TensorToPlasmaOp<CPUDevice>); | ||
| REGISTER_KERNEL_BUILDER(Name("TensorToPlasma").Device(DEVICE_GPU), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap these two GPU registrations with GOOGLE_CUDA guard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing this now
|
|
||
| REGISTER_KERNEL_BUILDER(Name("PlasmaToTensor").Device(DEVICE_CPU), | ||
| PlasmaToTensorOp<CPUDevice>); | ||
| REGISTER_KERNEL_BUILDER(Name("PlasmaToTensor").Device(DEVICE_GPU), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing this now
cpp/src/plasma/tf_plasma_op.cc
Outdated
|
|
||
| const int64_t size_in_bytes = object_buffer.data->size(); | ||
| TensorShape shape({size_in_bytes / sizeof(float)}); | ||
| // LOG(INFO) << "Output TensorShape: " << shape.DebugString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all commented out code can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing this now
cpp/src/plasma/tf_plasma_op.cc
Outdated
| OP_REQUIRES_ASYNC(context, success, | ||
| errors::Internal("D2H memcpy failed to be enqueued."), done); | ||
| } | ||
| // TODO(zongheng): does std::move() give better performance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
cpp/src/plasma/tf_plasma_op.cc
Outdated
| #include "tensorflow/core/framework/shape_inference.h" | ||
| #include "tensorflow/core/platform/logging.h" | ||
| #include "tensorflow/core/platform/mutex.h" | ||
| #ifdef GOOGLE_CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group with gpu_event_mgr include
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
robertnishihara
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @pcmoritz @concretevitamin!
| std::string plasma_store_socket_name_; | ||
| std::string plasma_manager_socket_name_; | ||
|
|
||
| mutex mu_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document all of the private fields. In particular, mu_ guards more than just client_, right?
|
|
||
| sess.run(to_plasma) | ||
| # NOTE(zongheng): currently it returns a flat 1D tensor. | ||
| # So reshape manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we serialize data as Arrow tensors, it should be easy to get rid of this limitation.
| if (std::is_same<Device, CPUDevice>::value) { | ||
| for (int i = 0; i < num_tensors; ++i) { | ||
| const auto& input_tensor = context->input(i); | ||
| std::memcpy(static_cast<void*>(data + offsets[i] / sizeof(float)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's serialize data as Arrow tensors.
| plasma_manager_socket_name="") | ||
|
|
||
| def FromPlasma(): | ||
| return plasma.tf_plasma_op.plasma_to_tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a page documenting the API and giving code examples for how to use it?
cpp/src/plasma/tf_plasma_op.cc
Outdated
| const int64_t size_in_bytes = object_buffer.data->size(); | ||
| TensorShape shape({size_in_bytes / sizeof(float)}); | ||
| // LOG(INFO) << "Output TensorShape: " << shape.DebugString(); | ||
| // LOG(INFO) << "size_in_bytes of the plasma object: " << size_in_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove all the dead code.
| } | ||
|
|
||
| const int64_t size_in_bytes = object_buffer.data->size(); | ||
| TensorShape shape({size_in_bytes / sizeof(float)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure we support different dtypes and preserve the tensor type.
This code looks like it might fail for tensors whose dtype is smaller than 4 bytes.
|
It'd be nice to ship GPU and CPU versions of the op with pyarrow. Is that feasible? |
|
TF's core may change across versions. For example I've had to rebuild some of my ops and horovod after a TF upgrade. Therefore it's probably not feasible to ship the op as a binary like the rest of pyarrow. It'll have to be built during |
|
I guess we can build against CUDA on Travis but not test with a real GPU. We should ensure that we can build all artefacts within Travis, otherwise it will be hard to make any binary releases. I'm thinking a bit if it may be better for this functionality to be available as a separate Python package as it may often be needed to be compiled from scratch. I would expect that we should be able to ship most parts of
As we ship wheels with |
|
replaced by #2104 |
…between Plasma and TensorFlow This is based off of #2046 from @pcmoritz and @concretevitamin, who wrote the original TensorFlow Op with help from @ericl. In addition to the previous PR, this supports serialization of arbitrary type tensors. To get it working, arrow has to be compiled in the following way: ``` cmake -DARROW_PLASMA=on -DARROW_TENSORFLOW=on -DARROW_PYTHON=on .. make -j8 sudo make install ``` And pyarrow like that: ``` PYARROW_WITH_PLASMA=1 PYARROW_WITH_TENSORFLOW=1 python setup.py develop ``` The PR also includes utilities that should be generally useful in converting between TensorFlow and Arrow. It is currently supposed to be used to transfer weights in and out of TensorFlow but hopefully in the future there will also be support for data Tensors. Author: Philipp Moritz <[email protected]> Author: Peter Schafhalter <[email protected]> Closes #2104 from pschafhalter/tensor-serialization and squashes the following commits: 50c6635 <Philipp Moritz> don't import tensorflow on manylinux1 3b7bbc7 <Philipp Moritz> skip tests that use tensorflow inside of manylinux1 container 097d48d <Philipp Moritz> Merge branch 'master' into tensor-serialization df44982 <Philipp Moritz> Merge branch 'master' into tensor-serialization 13e4db3 <Philipp Moritz> remove setting the deleted variable to nullptr b79c710 <Philipp Moritz> more fixes 200ad89 <Philipp Moritz> move build flags earlier b4b4134 <Philipp Moritz> tensorflow namespace b0731ad <Philipp Moritz> fixes 10e369a <Philipp Moritz> more fixes 01e580c <Philipp Moritz> more cleanups 027d683 <Philipp Moritz> fix e494f28 <Philipp Moritz> clean up code 1ef52d0 <Philipp Moritz> fix 90757c1 <Philipp Moritz> fix gpu code d5d3740 <Philipp Moritz> fix d137fcf <Philipp Moritz> compile pyarrow without cxx-11 ABI too 81f34ff <Philipp Moritz> fix b4e02e1 <Philipp Moritz> fix test 749e21e <Philipp Moritz> cleanups and test fb8786e <Philipp Moritz> fix bc3a5bb <Philipp Moritz> cleanups 56c9cf1 <Philipp Moritz> fix c6066ab <Philipp Moritz> fix 2fe3523 <Philipp Moritz> cleanups 4dc4ff3 <Philipp Moritz> fix TensorFlow test 3264efe <Philipp Moritz> test acc1f0b <Philipp Moritz> fix test 372551f <Philipp Moritz> pkg-config 64ee050 <Philipp Moritz> fix test d83b9ae <Philipp Moritz> fix test 2385e7c <Philipp Moritz> fix af4eb2f <Philipp Moritz> cleanups 6a34556 <Philipp Moritz> move things around afa5dba <Philipp Moritz> fix b29bf39 <Philipp Moritz> linting 14c9e50 <Philipp Moritz> fix linting eb1a576 <Philipp Moritz> fix 47a21c9 <Philipp Moritz> add license 1c8c5fa <Philipp Moritz> Merge branch 'master' into tensor-serialization 1442be0 <Philipp Moritz> test more dtypes ead1a2e <Philipp Moritz> test af68371 <Philipp Moritz> debug2 a2131ad <Philipp Moritz> debug 531f4a7 <Philipp Moritz> fix be83a20 <Philipp Moritz> try datatype 077266b <Philipp Moritz> fixes 1164d5a <Philipp Moritz> fixes 4f641c3 <Philipp Moritz> fix warnings a674720 <Philipp Moritz> fix 140a2b9 <Philipp Moritz> needs debugging 8b1dd8c <Philipp Moritz> get write path working d7ead2b <Philipp Moritz> fix d291f64 <Philipp Moritz> fix f461b38 <Philipp Moritz> fix linting 3b01abe <Philipp Moritz> fix 232ed6f <Philipp Moritz> add tensor utilities f7466de <Peter Schafhalter> Formatting 3ed436b <Peter Schafhalter> Formatting f4ad8e5 <Peter Schafhalter> Check that all tensors have the same dtype f601bb0 <Peter Schafhalter> Convert Tensor type to Arrow 4167d1d <Peter Schafhalter> Fix tensor serialization e261f2c <Philipp Moritz> fixes for the op d381ff3 <Philipp Moritz> fixes c705d53 <Philipp Moritz> fix for python 2.7 22b5564 <Philipp Moritz> fix wheel building ae2e7ca <Philipp Moritz> adapt code for arrow f6cef78 <Philipp Moritz> add initial tensorflow op
written by @concretevitamin