-
Notifications
You must be signed in to change notification settings - Fork 75.2k
Add Apache Arrow Support to TensorFlow Dataset #23002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is currently a WIP, but wanted to open the PR to open discussion. Still needed:
|
|
@yongtang I saw that you are adding a Parquet Dataset in #19461. There is a lot of overlap since the parquet-cpp reader is based on Arrow. I believe that the parquet-cpp reader can produce Arrow record batches that could then leverage the base classes here that iterate over the batches. It would be nice to have a common Arrow layer in TensorFlow Dataset that is easy to extend for specific Ops. Would you be interested in collaborating? |
|
@BryanCutler Yes it would be great to have a common layer for Arrow. PR #19461 has been approved, and is awaiting for bazel update (to 0.17.1) in #22449 to be merged. I think the arrow part could be consolidated once in. |
|
Add @dmitrievanthony for I/O related discussion. |
|
Thanks @yongtang. Hi @BryanCutler, PR looks very promising! I see you implemented only Posix sockets so far, let me suggest you to have a look at Apache Ignite Dataset (#22210). In this dataset I implemented Windows sockets as well as Posix, it might be helpful. |
Great, thanks @yongtang! I will keep an eye out for #19461 being merged and then we can think about consolidating it in this PR or a followup, if this one gets approved. |
Thanks @dmitrievanthony ! I would have liked to use boost asio, but I'm not sure if it's worth adding the dependency for that. If it's been done for Windows too in Apache Ignite Dataset, I'll take a look there and try to do the same. |
|
@BryanCutler The PR #19461 has boost dependency in place so it might be less of an issue for adding asio. Though boost dependency only works with bazel 0.17.1+ (that is also the reason merge of #19461 is blocked now). |
|
As a suggestion, you could consider making this a separate python package, which depends on both tensorflow and apache arrow to build. The tooling around these kinds of python packages with binary dependencies is still very new, but doable. Tensorflow does export its dataset header files. They are not under the API stability guarantees, but they have been fairly stable from what I observed (two method signatures changed trivially in 1.10, I believe). OF course, @mrry knows more, as the author. I myself have made my own Dataset in an external project, using cmake. I don't have a ready-made example, but I know that arrow use cmake as well, so it is very easily doable to have this live separately from the tensorflow repo. You can follow up with me about that. I mention this because tensorflow/contrib is going away. Super cool work, by the way! I remember chatting with a Databricks engineer at the Spark Summit about how Arrow could be a great complement to Project Hydrogen. |
|
@BryanCutler See related discussion about |
e42861a to
faee235
Compare
|
@BryanCutler When In the short term though, I guess it may depend on the length of the transition period. If 2.0 is imminent then PR may not be accepted. (the PR could be opened to Don't know the planned date of 2.0 yet. Maybe @martinwicke have some insight about the 2.0 schedule? |
|
We announced we wouldn't accept new projects into contrib when we announced 2.0. I think this is an obvious candidate for sig-io. @ewilderj how far are we with sig-io? I think we're looking for a date, right? Moving there will delay this for at least a few weeks as we have to set up the sig-io repo and integration. But if we were to accept this here, it would simply have to be moved to sig-io anyway, and I don't think there will be another TF release which could include this before sig-io is off the ground, so I think the extra work is not too useful. |
|
I created the mailing list for SIG IO (announcement on developers@ list) and am recruiting for interest for that group. Once we've got a few folks signed up we can kick off the SIG with a conference call and get things moving. |
|
Thanks for the replies @ewilderj and @martinwicke. I signed up to the SIG and am looking forward to the discussions, thanks! |
… batches. Define 3 ops to read record batches: 1) from memory, 2) from Feather files, 3) from an input stream/socket
faee235 to
3d94ae5
Compare
|
Nagging Reviewer @mrry: You have been added as a reviewer to this pull request. Please add your review or reassign. It has been 39 days with no activity and the |
|
I removed myself as reviewer, since it looks like this is going to go through the SIG IO process instead. |
|
I am not involved in sig-io |
|
This PR has been moved to tensorflow/io#36 and can be closed |
Apache Arrow is a standard format for in-memory columnar data. It provides a cross-language platform for systems to communicate and operate on data efficiently.
Adding Arrow support in TensorFlow Dataset will allow systems to interface with TensorFlow in a well defined way, without the need to develop custom converters, serialize data, or write to specialized files.
This PR defines an Arrow base layer that will create a TensorFlow Dataset with an iterator over Arrow record batches to produce Tensor values for each column. This base layer is then extended to implement three Dataset Ops that consume Arrow record batches: 1) from Python memory / Pandas DataFrames, 2) Reading Arrow Feather files, 3) Input stream with a socket client to connect to a server streaming Arrow record batches. These are implemented now because they are straightforward and can provide some good initial functionality, but the design here should allow for more Arrow Ops in the future.
fixes #23001